feat: DuckDB-only paired-histogram backend (DRC-3390 Stage B)#1399
feat: DuckDB-only paired-histogram backend (DRC-3390 Stage B)#1399danyelf wants to merge 8 commits into
Conversation
…390 Stage B)
Stage B of the DRC-3390 paired-histograms restructure (see
``/Users/danyel/.claude-recce/plans/honestly-when-i-read-nested-candle.md``).
First end-to-end runnable cut: backend serves paired-histogram payloads on
DuckDB; everything else returns ``{status: "unsupported"}``. Stage D will
restore polyglot adapter coverage; Stage C wires the UI to the contract
this PR freezes.
Backend (Python)
- ``recce/adapter/capabilities.py`` — capability table reduced to one row
(DuckDB = full tier). ``detect_capabilities()`` returns the disabled tier
for every other adapter; the strategy router short-circuits with the
unsupported envelope on those. Stage D restores the polyglot tiers.
- ``recce/adapter/approx_aggregates.py`` — ``render_approx_count_distinct``,
``render_approx_top_k``, ``render_approx_percentile`` keep only the
DuckDB branch. Non-DuckDB adapter types raise
``UnsupportedAggregateError``. Deletes the Snowflake / BigQuery /
Databricks / Spark / Athena / Trino / Presto / ClickHouse / Redshift
renderer branches (~85 LOC) — Stage D restores them.
- ``recce/tasks/profile_distribution.py`` — full orchestration from PR 2,
trimmed:
* ``pick_strategy()`` returns ``approx_all`` for DuckDB, ``unsupported``
otherwise. Deletes the ``percentile_only`` branch.
* ``parse_topk_result()`` keeps only the DuckDB flat-list branch; raises
``ValueError`` for other adapter types. Deletes the Snowflake-pairs /
BigQuery-struct / Databricks-struct / Trino-map branches (~70 LOC).
* ``render_epoch_cast()`` keeps only DuckDB's ``epoch(col)``; raises for
others. Deletes Snowflake / BigQuery / Databricks / Spark / Trino /
Athena / Presto / ClickHouse / Redshift epoch conversions (~12 LOC).
* ``_coerce_percentile_array()`` collapses to the flat-list path
(BigQuery 101-element slice deleted).
* DRC-3504 (timestamp epoch cast) and DRC-3507 (DuckDB rollback on
per-column failure) fixes preserved verbatim.
* In-memory memoization preserved verbatim.
* Telemetry preserved; docstring corrected from "Emit a PostHog ..."
to "Emit an Amplitude ..." (matches the ``log_performance`` →
Amplitude HTTP route at recce/event/collector.py:19).
- ``--inline-profile`` CLI flag + ``RecceServerFlags.inline_profile``
plumbing.
- ``RunType.PROFILE_DISTRIBUTION`` enum + ``dbt_supported_registry``
registration so the task is wired through.
Tests (Python)
- ``tests/adapter/test_capabilities.py`` — DuckDB at full tier, everything
else disabled. Drops the per-tier parametrization for Snowflake /
BigQuery / etc. (returns in Stage D).
- ``tests/adapter/test_approx_aggregates.py`` — DuckDB-positive SQL
snapshots + parametrized "every other adapter raises
``UnsupportedAggregateError``" coverage.
- ``tests/tasks/test_profile_distribution.py`` — DuckDB-targeted (continuous
/ low-card / high-card / UUID-cap-degenerate / timestamp DRC-3504
regression / deliberately-bad-column DRC-3507 regression / memoization
hit-miss / unsupported tier short-circuit / strategy router). Drops the
multi-adapter strategy-router and per-adapter top-K shape tests (back
in Stage D).
API contract (TypeScript)
- ``js/packages/ui/src/api/types/run.ts``:
* Add ``"profile_distribution"`` to ``RunType``.
* ``ProfileDistributionParams`` (``model`` + optional ``columns``).
* ``ProfileDistributionResult`` discriminated union — all four variant
cases expressed (``histogram`` / ``topk`` / ``null`` per-column failure
/ ``unsupported`` envelope) — Stage C narrows on ``kind`` and
``status``.
* ``Run`` arm + ``isProfileDistributionRun`` guard.
* Promote ``runTypeHasRef`` to a real type guard
(``runType is RunTypeWithRef``).
* Add ``RunTypeWithRef`` type.
- ``js/packages/ui/src/components/run/registry.ts``:
* ``RegisteredRunType``, ``REGISTERED_RUN_TYPES``,
``isRegisteredRunType`` guard — backend-only run types like
``profile_distribution`` are deliberately out of the checklist
registry.
* ``findByRunType`` and ``createBoundFindByRunType`` narrow to
``RegisteredRunType``.
- ``js/packages/ui/src/components/run/types.ts``: ``RunTypeRegistry``
keys on ``RegisteredRunType`` (was ``RunType``).
- ``js/packages/ui/src/api/flag.ts``: ``inline_profile`` boolean.
- Two callsites that pass arbitrary ``RunType`` get the
``isRegisteredRunType`` guard added: ``CheckDetailOss.tsx``,
``RecceActionAdapter.tsx``.
Cherry-picked test fixes (tied to RegisteredRunType, ride with this PR)
- ``js/src/components/run/__tests__/registry.test.ts`` — narrow
``RunType[]`` → ``RegisteredRunType[]``.
- ``js/src/lib/hooks/__tests__/RecceActionAdapter.test.tsx`` — stub
``isRegisteredRunType`` in the module mock so the adapter doesn't bail
before calling submitRun.
Verification
- ``black --check ./recce ./tests`` + ``isort --check-only ./recce ./tests``
clean.
- ``pytest tests/tasks/test_profile_distribution.py
tests/adapter/test_capabilities.py tests/adapter/test_approx_aggregates.py``
— 116 / 116 pass.
- Full ``pytest tests/`` (excluding 6 pre-existing SPA-route tests that
require a built ``js/out/`` bundle not present in fresh worktrees) —
1318 / 1318 pass.
- DuckDB smoke against ``main.orders`` in
``/Users/danyel/code/Recce/jaffle_shop_duckdb/jaffle_shop.duckdb``:
continuous (amount) returns ``bin_edges[12]`` + ``base_density[11]`` +
``current_density[11]`` in 5.2ms; categorical (status) returns 5 values
flat-list, counts=None; datetime (order_date) renders via
``epoch(col)`` cast.
- Snowflake smoke against ``snowflake_dev`` target (jaffle_shop_duckdb
project): returns ``{status: "unsupported", reason: "Adapter 'snowflake'
does not support APPROX_PERCENTILE.", columns: {}}`` end-to-end through
``ProfileDistributionTask.execute()``. This is the path the colleague
will exercise to verify Stage B's non-DuckDB short-circuit.
- ``pnpm --filter @datarecce/ui type:check`` introduces zero new TS errors
vs main (baseline CSS-modules / useTrackLineageRender errors preserved
per recce-python-tooling memo).
Scope deletion summary
- ~85 LOC of non-DuckDB SQL renderers removed from approx_aggregates.py
- ~70 LOC of non-DuckDB top-K post-processor branches removed from
profile_distribution.py
- ~12 LOC of non-DuckDB epoch casts removed from render_epoch_cast
- Capability table reduced from 14 rows to 1
- ~12 per-adapter renderer snapshot tests deleted
- ~5 multi-adapter strategy-router / top-K-shape tests deleted
Combined Stage B (1895 LOC across 6 files) vs PR 1 + PR 2 source
(2159 LOC across the same 6 files): 264 LOC trim, with most of the
remaining size in the unchanged orchestration / memoization / payload
builders (which Stage D will extend, not rewrite).
Closes #1389 and #1390 (in spirit — those PRs were closed as part of
the post-Ultraplan restructure; their branches stay as code-stash for
Stage D).
Refs DRC-3390.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Danyel Fisher <danyel@gmail.com>
Codecov Report❌ Patch coverage is
... and 5 files with indirect coverage changes 🚀 New features to boost your workflow:
|
…e-b-backend-duckdb Signed-off-by: Danyel Fisher <danyel@gmail.com> # Conflicts: # recce/cli.py
Shared bin edges collapsed base_density and current_density to identical arrays — averaging the two envs' quantiles onto one grid threw away the divergence the paired histogram exists to show, and broke the equal-area property the renderer (PR #1398) depends on. With only approx percentiles from the warehouse (no bin counts), the honest rendering is each env on its OWN quantile edges: density = (1/NUM_BINS)/span. Base and current edges deliberately do not line up; Stage C overlays the two staircases on a shared value axis. - _build_histogram_payload now emits base_bin_edges + current_bin_edges via new _env_edges_and_density helper (fraction-delta / width). - API contract: ProfileDistributionHistogramPayload carries per-env edges. - Drop now-unused _safe_min/_safe_max. - Tests: assert base != current for divergent inputs, equal-area per bin, empty-without-bounds, and the parked point-mass (tie) behaviour. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
… miss The earlier cut introduced RegisteredRunType + REGISTERED_RUN_TYPES + an isRegisteredRunType guard solely because findByRunType was typed to always return an entry — so a run type without a registry entry (the backend-only profile_distribution) had to be ruled out before every call. That created a second run-type list that could drift from the registry. Collapse it: a registry lookup can simply miss. findByRunType is now an overload — literal registered keys keep their precise entry type; a dynamic RunType returns RegistryEntry | undefined. The two dynamic callsites (CheckDetailOss, RecceActionAdapter) null-check the result instead of guarding. Removes RegisteredRunType, REGISTERED_RUN_TYPES, isRegisteredRunType and their exports. profile_distribution stays a wire-level RunType with no registry entry. One run-type list again; the registry is the single source of truth for what's registered. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
Added speculatively for Stage C but has zero consumers (every sibling guard is used). Stage C can reintroduce it when it actually narrows on the type. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
Code Review: PR #1399SHA BlockersNone. Issues
Notes
What I verified as sound: the |
…unify no-counts encoding Address PR #1399 review (even-wei): - Issue 1: the empty-columns early return omitted base_total/current_total, which the frozen TS contract (ProfileDistributionOkResult) declares as required. Emit 0/0 (probe hasn't run) so Stage C never reads undefined. Add a regression test for the columns-filter-matches-nothing branch. - Note 2: cap-degenerate columns now emit base_counts/current_counts as null (was []), matching the normal categorical no-counts encoding. One 'no counts' shape per adapter; empty values flags degeneracy. - Note 1: document the datetime epoch-seconds edge encoding in the TS histogram payload contract so Stage C formats edges via out-of-band type. Note 3 (get_columns error taxonomy) deferred to Stage D per review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
|
Thanks for the early review @even-wei — addressed in Stepping back, before I let the 🤖 say its bit -- looks like all the stuff was edge cases (columns with no items, empty columns). Indeed, I'm pretty sure that our main Histogram code crashes on datetime columns (or returns nothing); datetimes are annoying and need cleanup. Thanks for catching them! -- Summary: Issue 1 ( Note 2 ( Note 1 ( Note 3 ( All |
…DRC-3390 Stage B) DuckDB's approx_top_k returns values with no counts and no per-env membership, so the categorical paired-histogram cell (which needs real counts OR real per-env ranks) had nothing to render. The cells were designed for a ranks payload (base_ranks/current_ranks/k) that lived only in the Storybook fixtures and never made it into the wire contract — the A<->B drift surfaced at Stage C. Promote the ranks variant into run.ts as the single source of truth: - run.ts: add ProfileDistributionTopKRanksPayload (kind "topk", mode "ranks", base_ranks/current_ranks/k); counts payload gains an explicit mode "counts" discriminant and its counts arrays are no longer wholly nullable. Re-export the ProfileDistribution* types from the api barrel. - profile_distribution.py: _build_topk_payload dispatches to a new _build_topk_ranks_payload whenever both envs lack counts (the DuckDB hot path); cap-degenerate columns emit the same ranks shape with empty arrays. Nothing emits wholly-null counts anymore. - tests: assert the ranks shape on the low-cardinality + cap-degenerate paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Danyel Fisher <danyel@gmail.com>
Code Review: PR #1399SHA Verified locally: Python BlockersNone. Issues
Notes
|
even-wei
left a comment
There was a problem hiding this comment.
Claude Code Review: NO-GO. 1 issue (top-K base-vs-current divergence dropped on truncation; untested) + 3 notes. Full review in the PR comment above.
Stage B — DuckDB-only paired-distribution backend + API contract
First runnable cut of the DRC-3390 paired-histograms work: the backend computes paired column distributions on DuckDB and returns a clean
{status: "unsupported"}envelope for every other adapter. No UI yet — this PR is the backend plus the TypeScript API contract the rest of the stack builds on.Stages: A = Storybook cells (#1398) · B = this PR · C = wire the UI to this contract · D = polyglot adapter coverage. Linear: DRC-3390.
What's in it
1.
--inline-profileserver flagNew flag (
RECCE_INLINE_PROFILE), plumbed throughRecceServerFlags.inline_profile. It gates the whole feature; off by default, so nothing changes for existing users.2. Adapter capability layer + the profile task
Two new modules decide whether and how an adapter can compute approximate aggregates:
capabilities.py— a one-row table:duckdb→ full tier, every other adapter → disabled. The strategy router short-circuits non-DuckDB to theunsupportedenvelope instead of emitting SQL the engine can't parse.approx_aggregates.py— DuckDB renderers forapprox_count_distinct/approx_quantile/approx_top_k; non-DuckDB raisesUnsupportedAggregateError.tasks/profile_distribution.pyorchestrates the pipeline (HLL probe → percentile histograms → top-K), registered asRunType.PROFILE_DISTRIBUTION. Stage D adds the other adapters back to the table.Per-env histogram edges (the key design call). Each env's histogram is built from its own quantiles —
base_bin_edges+current_bin_edges, paired withbase_density/current_density(density = Δfraction / span). The two edge arrays deliberately don't line up; Stage C overlays them on a shared axis. We only get approx percentiles back from the warehouse (not bin counts), so per-env edges are the only assumption-free rendering — and they're what lets base and current actually diverge. This snake_case wire shape maps to the camelCasebaseBinEdges/currentBinEdgesthat #1398 consumes.3. Run-type registry
profile_distributionis dispatchable but has no registry entry (no checklist title / icon / result view — its results feed schema-view cells).findByRunTypenow returnsRegistryEntry | undefined, and the two dynamic callsites null-check. The registry stays the single source of truth for what's registered — no parallel run-type list.Verifying it
Automated (runs in CI):
pytest tests/tasks/test_profile_distribution.py \ tests/adapter/test_capabilities.py \ tests/adapter/test_approx_aggregates.py # 121 passingLive — there's no UI yet, so drive the run API directly against a DuckDB project:
Run against
jaffle_shop_duckdbwith a real base→current change (base = HEAD; current = a working-tree edit that dropsreturnedorders and renamesstatus), the per-env divergence comes through:{ "status": "ok", "strategy": "approx_all", "base_total": 999, "current_total": 903, // returned orders dropped by current "columns": { "amount": { "kind": "histogram", "base_bin_edges": [1, 7, 12.6, /* … */ 97], "current_bin_edges": [100, 703, 1276, /* … */ 9704], // own quantiles per env — don't line up "base_density": [/* 11 */], "current_density": [/* 11, differs from base */] }, "status": { "kind": "topk", // union of both envs' values "values": ["completed", "returned", /* … */ "ORDER-completed", "ORDER-placed", /* … */], "base_counts": null, "current_counts": null, "trimmed": false } } }Point the same project at a non-DuckDB target and the identical curl returns the envelope, not a crash:
{ "status": "unsupported", "reason": "Adapter 'snowflake' does not support APPROX_PERCENTILE.", "columns": {} }Out of scope / follow-ups
base_total/current_totalas a column-wide denominator for proportions. DuckDB'sapprox_top_kreturns no counts, so Stage B always renders in ranks mode, which doesn't use totals. The field only becomes necessary once count-returning adapters land in Stage D. (Raised in the feat: paired histogram cells in Storybook (DRC-3390 Stage A) #1398 review; recorded here so it isn't lost.)0(no spike yet) — documented, follow-up later