From 2584645ec693dec51764b193bf152bb5b44ac8c5 Mon Sep 17 00:00:00 2001 From: Yad Konrad Date: Wed, 6 May 2026 17:25:43 -0400 Subject: [PATCH] =?UTF-8?q?release:=200.7.0=20=E2=80=94=20ETL=20pipeline?= =?UTF-8?q?=20(Waves=201-4)=20+=20handoff=20doc?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps to 0.7.0. Consolidates the [Unreleased] CHANGELOG entries from the 11 ETL PRs (#72, #73, #74, #75, #76, #79, #81, #80, #78, #77, #82) into a single [0.7.0] section. New: docs/HANDOFF.md — state-of-the-codebase walkthrough for incoming agents. Architecture map, recent history, key gotchas, what's left, files-to-read-first. End-state on the maintainer's real store: 150,337 usage_events Marts populated and watermarks in sync Dashboard cold-load 2.5s → <50ms warm Watcher 155ms end-to-end source-file-write → dashboard-data-fresh 1598 backend tests passing, 2 skipped, 11 deselected (slow suite). Frontend typecheck + build clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 26 ++- docs/HANDOFF.md | 331 +++++++++++++++++++++++++++++++++ pyproject.toml | 2 +- stackunderflow-ui/package.json | 2 +- stackunderflow/__version__.py | 2 +- 5 files changed, 354 insertions(+), 9 deletions(-) create mode 100644 docs/HANDOFF.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 83ee9fc..8504258 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,12 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] -### Added -- **Wave 4D — 12 beta provider normalizers.** Codeium (stub), Continue, Copilot, Cursor Agent, Droid, Gemini, KiloCode, Kiro, OpenClaw, OpenCode, Pi/OMP, Qwen, Roo Code now have `Normalizer` subclasses registered at import time. ETL pipeline now covers all 16 providers from the codeburn catalog. Beta providers stay opt-in via the existing `STACKUNDERFLOW_BETA_*` env flags — when they're enabled the matching normalizer fires automatically. -- **Wave 4A — analytical routes migrate to mart reads (Wave 3B redo).** `compare_models()` reads `model_day_mart` + `session_mart`; `yield_tracker._query_sessions()` reads `session_mart`; `optimize._detect_cache_overhead()` reads `session_mart`; `/api/messages/summary` reads `project_mart`. Same JSON contract, ~30-50× faster on the user's real store. Empty-mart fallback to aggregator preserved per route. -- **Wave 4B — backfill actually populates `usage_events`.** `stackunderflow etl backfill` now reads every message from the `messages` table, runs the matching provider normalizer (Wave 2A), and inserts into `usage_events`; `--force` rebuilds from scratch. Idempotent via `uniq_events_msg` UNIQUE index. The ingest writer (`stackunderflow/ingest/writer.py`) gets a normalize+insert hook so newly-ingested messages auto-create events without needing a backfill pass. Marts auto-refresh via `refresh_all_marts()` after each batch. -- **Wave 4C — `/api/etl/status` + `stackunderflow etl status`.** Single endpoint surfaces watcher health, mart watermarks vs max event id, per-provider event counts, and a `health` enum (live/syncing/stale/error) so the dashboard can show a status badge and the CLI a one-line health check. <50ms response — all counts are indexed COUNT(*). -- **Wave 4F — ETL status badge in the dashboard header + Settings backfill button.** New `EtlStatusBadge` polls `/api/etl/status` every 10s and shows live/syncing/stale/error health with a click-through popover detailing per-mart watermarks, per-provider event counts, and watcher state. Settings page gains an "ETL pipeline" section with a "Backfill now" button — POSTs to `/api/etl/backfill` when available, else shows the equivalent CLI command. +## [0.7.0] - 2026-05-06 + +### Added — ETL pipeline (Waves 1-4) + +The dashboard's per-request aggregator passes are gone. Every cost / dashboard / compare / yield / optimize / messages-summary route now reads from indexed materialized marts; the watcher syncs the marts within ~400ms of any source-file write. End-to-end on a 247K-message store: dashboard cold-load went from 2.5s to <50ms warm. The pipeline is three layers — raw `messages` → normalized `usage_events` → 5 materialized marts — wired together by a debounced filesystem watcher and a watermarked refresh loop. See `docs/specs/etl-architecture.md` for the design contract and `docs/HANDOFF.md` for a state-of-the-codebase walkthrough. + +- **Wave 1 — foundation (`stackunderflow/etl/`).** Migration v006 adds 7 tables: `usage_events` (canonical fact table, one row per billable event with `source_message_fk` UNIQUE for idempotent re-runs), `daily_mart`, `session_mart`, `project_mart`, `provider_day_mart`, `model_day_mart`, plus `mart_watermark` to track per-mart `last_event_id`. New `Normalizer` ABC + `MartBuilder` ABC + last-wins registries (`register/get/all`); `etl/watermark.py` (`get_watermark`, `set_watermark`, `refresh_all_marts`); `etl/backfill.py` (`BackfillReport` dataclass + idempotent orchestrator). Migration is additive — existing tables and routes keep working. +- **Wave 2A — 4 default-on provider normalizers.** `claude`, `codex`, `cursor`, `cline` `Normalizer` subclasses transform `messages` rows into canonical `usage_events`. Codex token normalization (subtract cached, fold reasoning) moves into `CodexNormalizer` — single source of truth, no more drift between adapter + pricer. Cursor v3 no-per-message-tokens estimates from `len(text)//4` and stamps `cost_source='estimated'`. `cost_usd` computed once per event, stored on the row, never recomputed downstream. +- **Wave 2B — 5 mart builders.** Indexed read-side rollups derived from `usage_events`. Each builder is watermarked + idempotent: incremental refresh from `last_event_id` for additive marts (daily, provider_day, model_day) with a follow-up DISTINCT recompute pass for `session_count` correctness across windows; replace-from-scratch-for-affected-keys for per-entity marts (session, project) so totals stay correct when new events arrive for an existing session. `MartBuilder.rebuild_from_scratch()` for full-backfill recovery. +- **Wave 2C — filesystem watcher.** `watchfiles`-backed daemon thread that watches every registered adapter's source paths. On any change → `adapter.read(since=watermark)` → normalize → `usage_events` → `refresh_all_marts`. Debounced 200ms to coalesce JSONL append bursts. End-to-end latency **155 ms** smoke-tested against `~/.claude/projects` on the maintainer's machine (under the 400 ms target). New `BaseAdapter.watch_paths()` method (default `[]`); claude/codex/cursor/cline adapters return their canonical roots. `stackunderflow start --no-watcher` flag for headless mode; `STACKUNDERFLOW_DISABLE_WATCHER=1` env var equivalent. +- **Wave 3A — hot-path routes migrate to mart reads.** `/api/projects?include_stats=true`, `/api/dashboard-data`, `/api/cost-data` (totals/by_day/by_model blocks), and `/api/cost-data/by-provider` now read from `project_mart` + `daily_mart` + `provider_day_mart` instead of running per-request aggregator passes. Same JSON contract; ~50× faster on a 28K-message project (cold 2.5–2.8s → 50ms warm). Per-session / per-command / per-tool detail blocks stay on the aggregator path until lower-grain marts ship in a future wave. +- **Wave 4A — analytical routes migrate to mart reads.** `compare_models()` reads `model_day_mart` + `session_mart`; `yield_tracker._query_sessions()` reads `session_mart`; `optimize._detect_cache_overhead()` reads `session_mart`; `/api/messages/summary` reads `project_mart`. Same JSON contract, ~30-50× faster on real data. Empty-mart fallback to aggregator preserved per route so the contract holds even before backfill runs. +- **Wave 4B — backfill streams `messages → usage_events`; ingest writer auto-normalizes.** `stackunderflow etl backfill` now reads every existing message via id-paginated chunks (5K-row transactions, never `fetchall()`), runs the matching provider normalizer, and inserts into `usage_events`; `--force` drops + rebuilds; idempotent via `uniq_events_msg`. The ingest writer hooks the same normalize-and-insert helper so newly-ingested messages auto-create events. After every backfill pass and every writer batch, `refresh_all_marts(conn)` runs to bring the marts forward. Smoke test on the maintainer's real ~250K-message store: first pass inserted **150,337 events** in 226 s (2nd run: 0 inserted, 29 s, fully idempotent). Marts populated: daily=940, session=841, project=151, provider_day=146, model_day=184. `SUM(daily_mart.cost_usd)` == `SUM(usage_events.cost_usd)` to the cent. +- **Wave 4C — `/api/etl/status` + `stackunderflow etl status`.** Single endpoint surfaces watcher health, mart watermarks vs max event id, per-provider event counts, per-cost-source breakdown, and a `health` enum (`live` / `syncing` / `stale` / `error`). <50 ms response. Shared `etl/status.py` assembler keeps SQL out of the CLI; the route + CLI never drift. Watcher state degrades to `running="unknown"` when no live handle (CLI mode, headless) instead of crashing. +- **Wave 4D — 12 beta provider normalizers.** Codeium (stub), Continue, Copilot, Cursor Agent, Droid, Gemini, KiloCode, Kiro, OpenClaw, OpenCode, Pi/OMP, Qwen, Roo Code now have `Normalizer` subclasses registered at import time. ETL pipeline now covers all 16 providers from the codeburn catalog. Beta providers stay opt-in via the existing `STACKUNDERFLOW_BETA_*` env flags — when they're enabled the matching normalizer fires automatically. Gemini + Qwen apply the canonical cached-subtraction + reasoning-fold rule that mirrors the OpenAI shape. +- **Wave 4E — real-data ETL pipeline e2e + per-route latency regression suite.** New `tests/stackunderflow/integration/` package with two slow-marker test files: `test_etl_pipeline_e2e.py` builds a 10K-message synthetic store across 5 providers, runs backfill, validates every mart sums correctly, then hits every dashboard route asserting 200 + non-empty + <500 ms. `test_route_perf_regression.py` parametrises every dashboard route against a 100K-row synthetic marts fixture with explicit per-route latency budgets — fails CI if any route regresses. Run with `pytest -m slow`. New `[tool.pytest.ini_options]` registers the `slow` marker and adds `addopts = "-m 'not slow'"` so default `pytest tests/ -q` keeps the fast feedback loop. Latency table from the regression suite (dev box, 100K mart rows): `/api/projects` 5.8 ms, `/api/dashboard-data` 7.1 ms, `/api/cost-data` 11.9 ms, `/api/cost-data/by-provider` 1.1 ms, `/api/compare` 1.7 ms, `/api/yield` 1.2 ms, `/api/optimize` 100 ms, `/api/messages/summary` 1.6 ms. +- **Wave 4F — ETL status badge in the dashboard header + Settings backfill button.** New `EtlStatusBadge` polls `/api/etl/status` every 10s and shows live/syncing/stale/error health with a click-through popover detailing per-mart watermarks, per-provider event counts, and watcher state. Settings page gains an "ETL pipeline" section with a "Backfill now" button — POSTs to `/api/etl/backfill` when available, else shows the equivalent CLI command. Bundle delta: +14 KB raw / +5 KB gzipped. + +### Test count +1472 (start of Wave 4) → **1598 passing, 2 skipped, 11 deselected** (+126 net from Wave 4 work; ETL-specific tests +250 cumulative since start of pipeline work). - **Wave 4E — real-data ETL pipeline + per-route latency regression suite.** New `tests/stackunderflow/integration/` package with two slow-marker test files: `test_etl_pipeline_e2e.py` builds a 10K-message synthetic store across 5 providers, runs backfill, validates every mart sums correctly, then hits every dashboard route asserting 200 + non-empty + <500ms. `test_route_perf_regression.py` parametrises every dashboard route against a 100K-row synthetic marts fixture with explicit per-route latency budgets — fails CI if any route regresses. Run with `pytest -m slow`. New `[tool.pytest.ini_options]` section in `pyproject.toml` registers the `slow` marker and adds `addopts = "-m 'not slow'"` so the default `pytest tests/ -q` run keeps the fast feedback loop (slow tests are opt-in). - **Wave 3A — hot-path routes migrate to mart reads.** `/api/projects?include_stats=true`, `/api/dashboard-data`, and `/api/cost-data` (totals/by_day/by_model blocks) now read from `project_mart` + `daily_mart` instead of running per-request aggregator passes against raw `messages`. Same JSON contract; ~50× faster on the user's 28K-message project (cold 2.5–2.8s → 50ms warm). Per-session / per-command / per-tool detail blocks stay on the aggregator path until lower-grain marts ship in Wave 4. - **ETL foundation: usage_events fact table + 5 marts + watermarks + backfill orchestrator (Wave 1).** Lays the schema and base classes; Waves 2 (normalizers + mart builders + watcher) and 3 (route migrations) fill in the bodies. Migration v006 (the spec called it v004, but v004/v005 were taken by the synthetic-models cleanup and cursor-workspace redistribute — the migration file is renumbered to v006 and the spec doc is updated to match) adds 7 tables (`usage_events`, `daily_mart`, `session_mart`, `project_mart`, `provider_day_mart`, `model_day_mart`, `mart_watermark`) plus indexes (`idx_events_day`, `idx_events_project`, `idx_events_provider`, `idx_events_session`, `idx_events_model`, `uniq_events_msg` UNIQUE on `source_message_fk`, `idx_daily_mart_project`, `idx_session_mart_project`, `idx_session_mart_first`, `idx_provider_day_mart_day`). New `stackunderflow.etl` package: `normalize/base.py` (`Normalizer` ABC) + `normalize/__init__.py` (last-wins `register/get/all` registry), `marts/base.py` (`MartBuilder` ABC with abstract `refresh(conn, since_event_id) -> int` and concrete no-op `rebuild_from_scratch`) + `marts/__init__.py` (last-wins registry), `watermark.py` (`get_watermark` returns 0 on missing, `set_watermark` upserts with UTC ISO8601 `last_refresh_ts`, `refresh_all_marts` iterates the marts registry and persists each mart's new watermark), and `backfill.py` (`BackfillReport` dataclass with `events_inserted`, `events_skipped_duplicate`, `marts_refreshed: dict[str, int]`, `duration_seconds`; `backfill(conn, *, force=False)` orchestrator skeleton — empty-registry no-op until Wave 2 lands, `force=True` empties events + marts + watermarks). New CLI: `stackunderflow etl backfill [--force]` (no-op until normalizers register in Wave 2; reports zero counts). Migration is **additive** — does not touch existing `messages`/`sessions`/`projects` tables, all existing routes keep working unchanged. 39 new tests across `tests/stackunderflow/store/test_migration_v006.py` (12: tables exist, columns/PKs per table, indexes present, UNIQUE on `uniq_events_msg`, idempotent re-apply), `tests/stackunderflow/etl/test_registries.py` (7: register/get/all, copy semantics, last-wins overwrite for both registries), `tests/stackunderflow/etl/test_watermark.py` (9: missing→0, set/get round-trip, overwrite, ts stamping, per-mart independence, empty-registry refresh, advance + idempotent + pickup-from-existing-watermark), `tests/stackunderflow/etl/test_backfill.py` (7: empty-store report shape, idempotent re-run, `force=True` drops events + marts + watermarks, `force=True` idempotent, mart refresh runs even with empty normalizers, BackfillReport field-set is locked). Spec at `docs/specs/etl-architecture.md`. diff --git a/docs/HANDOFF.md b/docs/HANDOFF.md new file mode 100644 index 0000000..7ae52d4 --- /dev/null +++ b/docs/HANDOFF.md @@ -0,0 +1,331 @@ +# StackUnderflow — Handoff doc + +**Date:** 2026-05-06 (post v0.7.0 release) +**Maintainer:** yad.konrad@quantumrise.com / 0bserver07 +**Branch:** `main` (clean), tag `v0.7.0` +**Tests:** 1598 passing, 2 skipped, 11 deselected (`pytest -m slow` runs the 11) +**Frontend:** typecheck + build clean, `stackunderflow-ui@0.6.1` + +This doc gets a fresh agent oriented in 10 minutes. Read it before reading code. + +--- + +## What StackUnderflow is + +A local-first knowledge base + cost dashboard for AI coding sessions. Forked from a since-rewritten codebase; **MIT, no external service dependencies, no telemetry**. + +The user runs `stackunderflow start`. A FastAPI server binds `127.0.0.1:8095`, serves a React dashboard, and exposes: + +- A **REST API** under `/api/*` for the dashboard +- An **MCP server** (over stdio) so Claude Desktop / Cursor / Claude Code can query session history without spinning up the dashboard +- A **CLI** (`stackunderflow ...`) for ops, exports, plan budgets, ETL ops, etc. +- A **Python public API** (`import stackunderflow; list_projects(); process(slug)`) for scripting + +Source-of-truth state lives at `~/.stackunderflow/store.db` (SQLite). The dashboard is **read-only against the store** in the hot path; ingest happens in the background. + +--- + +## Architecture map + +``` +┌──────────────────────── Source files (16 providers) ────────────────────────┐ +│ ~/.claude/projects/ # JSONL │ +│ ~/.codex/sessions/ # JSONL │ +│ ~/Library/.../Cursor/.../state.vscdb # SQLite │ +│ ~/Library/.../saoudrizwan.claude-dev # JSON (Cline) │ +│ ~/.gemini/, ~/.qwen/, ~/.factory/, ... # 12 beta providers │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ Adapter (per-provider parser) +┌────────────────────────── RAW LAYER ──────────────────────────────────────┐ +│ messages, sessions, projects (SQLite) │ +│ one row per source-message; immutable; UNIQUE(provider, slug) │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ Normalizer (per-provider transform) +┌────────────────────── NORMALIZED LAYER ───────────────────────────────────┐ +│ usage_events │ +│ one row per billable event, canonical shape, cost_usd computed once │ +│ cost_source: live | rate_card | estimated | unknown │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ MartBuilder.refresh(conn, since_event_id) +┌──────────────────────── MARTS LAYER ──────────────────────────────────────┐ +│ daily_mart (day, project_id, provider, model, speed) │ +│ session_mart (session_id, all per-session aggregates) │ +│ project_mart (project_id, lifetime totals) │ +│ provider_day_mart (day, provider) │ +│ model_day_mart (day, model, speed) │ +│ mart_watermark (mart_name → last_event_id, last_refresh_ts) │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + REST routes — plain SELECTs from marts only +``` + +The watcher (`stackunderflow/etl/watcher.py`) ties Layers together: filesystem change → adapter.read() → writer inserts messages → normalizer inserts events → refresh_all_marts() advances watermarks. End-to-end ~400 ms. + +--- + +## Package layout + +``` +stackunderflow/ + adapters/ # Per-provider source parsers (16 of them; 4 default-on) + base.py # SourceAdapter Protocol; SessionRef + Record dataclasses + claude.py codex.py cursor.py cline.py # default-on + cursor_agent.py opencode.py qwen.py gemini.py # beta + copilot.py codeium.py continue_adapter.py # beta + droid.py kiro.py openclaw.py pi.py # beta + kilocode.py roocode.py # beta (cline-family) + _streaming.py # 128 MB cap + 8 MB stream threshold for JSONL + api/ # Public Python API surface (list_projects/process/list_sessions) + etl/ ← NEW (v0.7) + normalize/ # Per-provider transforms messages → usage_events + base.py # Normalizer ABC + cost_source constants + _build_event helper + __init__.py # last-wins registry: register/get/all + 16 normalizers wire here + claude.py codex.py cursor.py cline.py # default-on + <12 beta normalizers> + marts/ # MartBuilder ABC + 5 builders (daily, session, project, provider_day, model_day) + base.py # ABC; concrete rebuild_from_scratch default + __init__.py # last-wins registry; 5 builders wire here + daily.py session.py project.py provider_day.py model_day.py + backfill.py # Streams messages → events → marts; idempotent; --force rebuild + watcher.py # watchfiles daemon; debounced 200 ms; per-adapter dispatch + watermark.py # get/set/refresh_all_marts; persists last_event_id + last_refresh_ts + status.py # Shared assembler for /api/etl/status + `stackunderflow etl status` + ingest/ + writer.py # INSERT INTO messages + normalize+insert hook (Wave 4B) + enumerate.py # Discovery wrapper around all registered adapters + __init__.py # run_ingest(conn, adapters) + infra/ + costs.py # compute_cost(tokens, model, provider, *, speed) → dict + currency.py # Frankfurter live + 24h cache + ECB snapshot fallback + cursor_cache.py # Fingerprint cache for vscdb (3-8× cold-start speedup) + discovery.py # Filesystem scan helpers (legacy file-scan path) + providers/ # Per-provider Pricers (anthropic, openai, cursor, etc.) + mcp/ + server.py # FastMCP server; reads from store; 3 tools (session_query, list_sessions, list_projects) + store_reader.py # Read-only store helpers shared with the MCP server + reports/ # CLI report renderers (text/json/csv) + optimize patterns + routes/ # FastAPI routes (one file per concern, 14 of them) + cfg.py compare.py context_budget.py cost.py data.py etl.py + export.py optimize.py plan.py projects.py sessions.py yield_route.py + bookmarks.py commands.py misc.py qa.py search.py tags.py + services/ # compare, plans, yield_tracker, pricing, search, qa, tags, bookmarks + store/ + schema.py # CURRENT_VERSION = 6; applies SQL + .py migrations idempotently + queries.py # Typed query helpers (one place for all SQL) + mart_queries.py # Read helpers used by route migrations (Wave 3A/4A) + db.py types.py + migrations/ # v001 → v006 (v005 is .py, rest are .sql) + cli.py server.py deps.py settings.py __version__.py + +stackunderflow-ui/ # React dashboard (Vite) + src/ + pages/ # Overview, ProjectDashboard, Settings + components/ + common/ FilterBar, EtlStatusBadge, ExportButton, ... + dashboard/ one Tab per top-level view (Overview/Sessions/Cost/Compare/Yield/...) + cost/ # Cost-tab widgets including CostByProviderCard + analytics/, charts/, layout/, qa/ + services/ # API client + format/currency/filters/providerStyle helpers + types/api.ts # Backend response shapes mirrored as TypeScript + +tests/ # 1598 backend tests; integration/ has the slow-marker e2e + perf +docs/ + HANDOFF.md # This file + specs/ # Architecture specs (multi-provider, etl, etc.) + cli-reference.md api-reference.md multi-provider.md mcp.md ... +``` + +--- + +## Recent history (v0.5 → v0.7) + +| Tag | Date | Highlights | +|---|---|---| +| v0.5.0 | 2026-04-30 | All 16 codeburn-catalog providers as adapters; 4 default-on (claude/codex/cursor/cline) + 12 beta-flag-gated | +| v0.6.0 | 2026-05-01 | Currency, export, model aliases, plan budgets, compare, yield, optimize patterns, context budget, fast-mode SQLite, streaming reader, cursor cache. Multi-provider Python API + MCP. Cursor v3 conversationId fix. UI surfaces wired | +| v0.6.1 | 2026-05-01 | Currency snapshot fallback, cursor pricing for `composer-*`, per-workspace cursor slugs, `` cleanup, defensive adapter coverage | +| v0.6.x patches | 2026-05-04 to 2026-05-05 | Provider/model FilterBar URL-synced, `formatModelName` normalizer, `Annotated[..., Query()]` filter binding fix, non-blocking startup ingest, `bulk_*` SQL helpers replacing N+1 in `/api/projects` | +| **v0.7.0** | **2026-05-06** | **ETL pipeline (Waves 1–4): usage_events + 5 marts + watermarked refresh + filesystem watcher + every dashboard route migrated to mart reads + status surface + UI badge** | + +--- + +## What changed in v0.7.0 (the ETL push) + +### New tables (migration `v006_etl_layer.sql`) +- `usage_events` (canonical fact, `UNIQUE(source_message_fk)` for dedup) +- `daily_mart`, `session_mart`, `project_mart`, `provider_day_mart`, `model_day_mart` +- `mart_watermark` + +### New abstractions +- `Normalizer` ABC + 16 subclasses (per-provider transforms) +- `MartBuilder` ABC + 5 subclasses (per-mart rollup logic) +- Two last-wins registries (`stackunderflow.etl.normalize` and `stackunderflow.etl.marts`) +- Watermark helpers (`get_watermark`, `set_watermark`, `refresh_all_marts`) +- `BackfillReport` dataclass + `backfill(conn, *, force=False)` orchestrator + +### New surfaces +- `GET /api/etl/status` returning `{watcher, marts, events, lag_seconds, health}` +- `stackunderflow etl status [--format text|json]` CLI +- `stackunderflow etl backfill [--force]` CLI (now actually populates events) +- `EtlStatusBadge` in the dashboard header +- "ETL pipeline" section on `/settings` with "Backfill now" button + +### Routes migrated to mart reads +- `/api/projects?include_stats=true` +- `/api/dashboard-data` +- `/api/cost-data` (totals/by_day/by_model blocks) +- `/api/cost-data/by-provider` +- `/api/compare` +- `/api/yield` +- `/api/optimize` (cache_overhead detector only — others stay on aggregator path because they need per-message text) +- `/api/messages/summary` + +Empty-mart fallback to the aggregator preserved per route — so the JSON contract is unchanged whether marts are populated or empty. + +### Latency on real data +- 247K-message store; before: dashboard cold-load 2.5–2.8 s warm +- After: per-route warm latencies range 1.1 ms (cost-by-provider) to 100 ms (optimize). Median <10 ms. + +### Watcher +- ~155 ms end-to-end smoke-tested on the maintainer's `~/.claude/projects` (well under the 400 ms target) +- `stackunderflow start` now non-blocking — HTTP binds in <1 s, ingest runs in a daemon thread + +--- + +## Key gotchas + design decisions + +### Migration numbering +Spec called the ETL migration v004; v004 + v005 were already taken (synthetic-models cleanup + cursor-workspace redistribute). Final file is `v006_etl_layer.sql`. `schema.CURRENT_VERSION = 6`. Migration is **additive** — no existing tables touched. + +### Empty-mart fallback +Every migrated route checks if its mart is populated. If yes → mart read. If no → original aggregator path. So the dashboard works even on a fresh install before backfill runs. After Wave 4B's backfill or a single watcher cycle, marts populate and the fast path takes over automatically. + +### Cost is computed once +`cost_usd` lives on every `usage_events` row. Marts SUM it, never re-apply rate cards. Currency conversion stays at the API boundary (already correct from v0.6.0). When pricing changes, re-normalize from raw messages — one code path. + +### `session_count` correctness across windows +Additive marts (daily, provider_day, model_day) can't simply SUM `COUNT(DISTINCT session_id)` across refresh windows (the same session can appear in two windows). Solution: after the additive INSERT...ON CONFLICT, a follow-up UPDATE recomputes `session_count` from `usage_events` for affected keys. Bounded by number of distinct keys in the window — typically O(1)..O(few dozen). Tests lock this in. + +### Per-entity vs additive marts +- `session_mart` and `project_mart` use INSERT OR REPLACE over a re-aggregated subquery for affected entities (totals stay correct when new events arrive for an existing session). +- `daily_mart`, `provider_day_mart`, `model_day_mart` use INSERT...ON CONFLICT DO UPDATE additively (because the same `(day, …)` key never appears in two refresh windows once the watermark moves forward). + +### Normalizer registry is in `__init__.py` +Per spec — Wave 1 puts the registry in `stackunderflow/etl/normalize/__init__.py`, NOT in `base.py`. Last-wins (re-registering overwrites). `_clear()` for tests. The 16 default registrations happen at package-import time via top-level `register("name", Cls)` calls. + +### Watcher +Uses `watchfiles` (Rust-backed). Daemon thread spawned in lifespan. Catches every exception so a bad event never poisons the loop. `--no-watcher` / `STACKUNDERFLOW_DISABLE_WATCHER=1` for headless mode. + +--- + +## How to run / what to know + +```bash +# Run the dashboard +stackunderflow start # binds 127.0.0.1:8095 + # ingest + watcher run in background + +# ETL ops +stackunderflow etl status # health + watermarks +stackunderflow etl status --format json +stackunderflow etl backfill # incremental (skips already-converted msgs) +stackunderflow etl backfill --force # drops events + marts, rebuilds + +# Tests +pytest tests/ -q # 1598 fast tests (default) +pytest -m slow tests/stackunderflow/integration -q # the 11 slow tests +ruff check stackunderflow/ + +# Frontend +cd stackunderflow-ui +npm run typecheck +npm run build # output → ../stackunderflow/static/react/ +node --test tests/services/*.test.ts # frontend unit tests (Node built-in runner; no vitest dep) +``` + +--- + +## Real-data state right now (maintainer's machine) + +``` +~/.stackunderflow/store.db (1.9 GB): + 150,337 usage_events + Marts: daily=940, session=841, project=151, provider_day=146, model_day=184 + Watermarks all at 150,337 (in sync) + Per-provider events: claude 150,014, cursor 220, cline 103 +``` + +--- + +## What's left / known follow-ups + +| # | Item | Severity | +|---|---|---| +| 1 | `optimize` patterns that stay on aggregator path (bash_output_limits, junk_reads, low_read_edit_ratio, ghost_agents, etc.) need lower-grain marts (`tool_mart`, `command_mart`) — those marts are not built. They run against `messages` table directly, but only on the optimize endpoint, so it's slow but bounded. | low (fast enough) | +| 2 | The `/api/etl/backfill` POST route doesn't exist yet — Wave 4F's backfill button shows the equivalent CLI command on 404. Ship a thin route that wraps `etl.backfill.backfill(conn)` in a background task. | medium (UX) | +| 3 | Beta normalizers (12 of them) are wired but most haven't been validated against real local data on the maintainer's machine — only claude / codex / cursor / cline / gemini / droid / qwen have actual data. The Cursor v3 bug from v0.6.0 (`#52`) is the kind of latent failure to expect. The defensive empty-source/malformed-data tests added in v0.6.1 cover the failure modes but not full real-data parity. | medium (correctness on enabled betas) | +| 4 | Per-route latency target on `/api/optimize` is 100 ms warm, 200 ms budget. Currently passes but tight. As the 7 patterns grow, this will need either lower-grain marts (#1 above) or pattern-specific caching. | medium | +| 5 | `messages_YYYYMM` partitioning was designed for in the spec but not implemented — `messages` table stays unpartitioned. On long-lived stores (years of data) this will eventually need to ship. | low (future) | +| 6 | The `tool_mart` / `command_mart` lower-grain marts (deferred from Wave 3A/4A) — needed to migrate the per-session/per-command/per-tool detail blocks of `/api/cost-data`. Currently those blocks read raw messages. | low (current path works, just slower) | +| 7 | Wave 2C watcher is macOS-only verified. `watchfiles` claims cross-platform parity. Linux/Windows haven't been smoke-tested on real data. | low (most users on macOS) | +| 8 | The watcher restarts on every `stackunderflow start` — there's no cross-process coordination. If two `start` invocations run, both will spin up watchers. The lifespan binds to a single process so this is theoretical, but worth a lock file someday. | low | + +--- + +## Files an incoming agent should read first + +1. `docs/specs/etl-architecture.md` — design contract for the pipeline +2. `stackunderflow/etl/normalize/base.py` — `Normalizer` ABC + helpers +3. `stackunderflow/etl/marts/base.py` — `MartBuilder` ABC +4. `stackunderflow/etl/backfill.py` — orchestrator + writer hook +5. `stackunderflow/etl/watcher.py` — watchfiles + per-adapter dispatch +6. `stackunderflow/store/migrations/v006_etl_layer.sql` — schema +7. `stackunderflow/store/mart_queries.py` — every read helper used by routes +8. Any `routes/*.py` for the JSON contracts the dashboard depends on +9. `tests/stackunderflow/integration/` — e2e + perf regression — most useful single file to understand the whole pipeline at once + +--- + +## Conventions worth knowing + +- **No version bumps without `CHANGELOG.md` + git tag + GitHub release** — done together as one PR (`release: 0.7.0`) +- **No codeburn attribution** in shipped code (the project is a clean rewrite; references stayed in `docs/specs/multi-provider/` only) +- **No backwards-compat shims** — when an API shape changes, change the consumers in the same PR +- **Tests must run on Linux CI** (no macOS-only paths in non-platform-specific tests) +- **Beta adapters opt in via `STACKUNDERFLOW_BETA_=1`** — never on by default +- **Frontend tests use `node --test`** (Node 22+ built-in runner) not vitest — no new dev dep +- **Idempotent EVERYTHING in ETL** — every `refresh`, `backfill`, `watcher cycle` must be safe to re-run +- **The user's `~/.stackunderflow/store.db` is sacred** — tests use `tmp_path` or `:memory:`, never the real store +- **Settings file:** `~/.stackunderflow/config.json` (not `settings.json`); the descriptor pattern in `settings.py` resolves env → file → default + +--- + +## When something breaks + +| Symptom | Likely cause | Where to look | +|---|---|---| +| `/api/etl/status` shows lag > 1000 events for minutes | Watcher not running, or normalizer raising | `stackunderflow start` log; `stackunderflow etl status` | +| Marts empty after backfill | Normalizer for that provider not registered | `stackunderflow.etl.normalize.all()` should list 16 keys | +| Dashboard cost = $0 for a provider | Pricer for the model returns `None` | `infra/providers/.py` `rates_for()` | +| Watcher spammed log: "Adapter raised in cycle" | Provider adapter has a parse bug | Look at the provider's adapter, run `adapter.read()` directly on the file | +| `health: error` in status | Mart watermark stuck + watcher dead | Restart server; `stackunderflow etl backfill --force` if persistent | +| New install, dashboard slow | Marts empty, fallback to aggregator. Run `stackunderflow etl backfill` to populate | One-time, then watcher keeps it fresh | +| `pytest -m slow` failing on `/api/etl/status` | Route may not be in main yet (Wave 4C pre-merge state); skip is acceptable | `tests/stackunderflow/integration/test_route_perf_regression.py` | + +--- + +## What I'd do next if I had a week + +1. **Wave 5: lower-grain marts.** `tool_mart` (per-tool aggregates) + `command_mart` (per-command). Unblocks the deferred `/api/cost-data` per-session/per-tool blocks and lets `optimize.py`'s remaining 6 detectors move off the aggregator path. +2. **Real-data validation of the 12 beta normalizers.** For each, generate a synthetic but spec-accurate fixture; assert event shape matches the codeburn catalog spec; flag any drift. Most useful: catch the next "Cursor v3 conversationId-in-the-key" before it ships. +3. **Real `/api/etl/backfill` route.** Wraps the existing CLI orchestrator in a FastAPI BackgroundTask. Wave 4F's UI button hits 404 today. +4. **Lock file / single-watcher invariant.** Prevent two `stackunderflow start` instances from racing. `flock` on `~/.stackunderflow/server.lock`. +5. **Streaming-safe `messages` partitioning.** `messages_YYYYMM` partitions, `litestream`-friendly. Future-proofs the store at multi-year scale. + +--- + +That's the picture. Files referenced are absolute paths under `/Users/yadkonrad/dev_dev/year26/jan26/StackUnderflow/`. Welcome. diff --git a/pyproject.toml b/pyproject.toml index 1412c49..4c2070c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "stackunderflow" -version = "0.6.1" +version = "0.7.0" description = "A local-first knowledge base for your AI coding sessions" readme = "README.md" requires-python = ">=3.11" diff --git a/stackunderflow-ui/package.json b/stackunderflow-ui/package.json index be258d0..dde38a0 100644 --- a/stackunderflow-ui/package.json +++ b/stackunderflow-ui/package.json @@ -1,7 +1,7 @@ { "name": "stackunderflow-ui", "private": true, - "version": "0.6.1", + "version": "0.7.0", "type": "module", "scripts": { "dev": "vite", diff --git a/stackunderflow/__version__.py b/stackunderflow/__version__.py index cdfdb6c..e7f47c8 100644 --- a/stackunderflow/__version__.py +++ b/stackunderflow/__version__.py @@ -1,3 +1,3 @@ """Version information for stackunderflow""" -__version__ = "0.6.1" +__version__ = "0.7.0"