Skip to content

linq_fold group_by: min/max/first reducers + multi-reducer fused pass#2724

Merged
borisbat merged 2 commits into
masterfrom
bbatkin/linq-fold-min-max-first-multi-reducer
May 19, 2026
Merged

linq_fold group_by: min/max/first reducers + multi-reducer fused pass#2724
borisbat merged 2 commits into
masterfrom
bbatkin/linq-fold-min-max-first-multi-reducer

Conversation

@borisbat
Copy link
Copy Markdown
Collaborator

Summary

Closes the remaining BufferGroupBy gaps from PR #2723's deferred-follow-up list. Builds on PR-A1's miss/hit emission split.

A.3 — per-reducer dispatch. Recognizer extended to 8 reducer shapes: bare {sum, min, max, first} + inner-select {sum, min, max, first} (<reducer>(select(<bind>._1, <lambda>))). New emit_reducer_branches helper produces per-reducer (missInit, hitUpdate) pairs.

  • min / max: direct < / > for workhorse acc types; _::less fallback for non-workhorse (matches the reference at linq.das:1224,1308).
  • first: miss-init only, no hit update. Subsequent same-key elements are ignored, exploiting PR linq_fold: buffer-required splice arms — reverse, distinct, group_by #2721's first-key-wins guarantee.
  • inner-select min / max: bind the projection result to a per-element temp so the inner body evaluates exactly once per source element — matches select laziness and avoids re-running side effects.
  • first_or_default(def) rejects on the 2-arg arity check and cascades.

A.4 — N+1-slot named tuples. The 2-slot recognizer (key + 1 reducer) is replaced by recognize_reducer_specs returning array<ReducerSpec>. Named-tuple form now accepts arbitrarily many reducer slots (key at _0, reducers at _1.._N). The planner walks the spec array once and concatenates per-slot missInit + hitUpdate statements into the per-element loop — N reducers fused into one pass.

Field access into dynamic slots (entry._{slot}) is built programmatically via mk_slot_ref since qmacro has no dynamic-field-name splice. The per-element loop drops the else branch entirely for first-only chains (no hit update needed).

Bail paths (cascade to tier 2): key not at slot 0; unrecognized reducer in any slot (e.g. average); first_or_default (fails arity).

Headline (100K rows, INTERP)

Benchmark m1 SQL m3 LINQ m3f splice Win
groupby_min 175 111 42 2.6× over m3 / 4.2× over SQL
groupby_max 173 108 43 2.5× over m3 / 4.0× over SQL
groupby_first 71 36 2.0× over m3 (no direct SQL aggregator)
groupby_multi_reducer 189 139 53 2.6× over m3 / 3.6× over SQL (3 reducers fused)
groupby_sum (regression check) 175 102 36 parity — refactor preserves PR-A1 splice
groupby_count (regression check) 141 71 36 parity — refactor preserves PR-A1 splice

All 6 splice variants land in ~36–53 ns/op — per-element work bounded by the single hash op + slot mutations regardless of which reducer or how many. Multi-reducer pays ~5 ns per extra slot.

Test plan

  • tests/linq/test_linq_fold.das — 257 tests pass (18 new in test_group_by_min_max_first_fold_parity + test_group_by_multi_reducer_fold_parity covering bare/named/inner-select/multi/empty/Person-fixture-parity)
  • tests/linq/test_linq_fold_ast.das — 102 tests pass (5 new: G5a min workhorse direct compare, G5b min non-workhorse _::less, G6 first no hit compare, G7 multi-reducer fused pass, G10 first_or_default cascade)
  • tests/linq/test_linq_group_by.das — 18 tests pass (regression check)
  • tests/linq/test_linq_aggregation.das — 46 tests pass
  • tests/linq/test_linq.das — 21 tests pass
  • AOT mode (test_aot -use-aot) — 257 + 102 tests pass
  • Lint clean (mcp__daslang__lint)
  • Format clean (mcp__daslang__format_file)
  • All 4 new benchmarks plus PR-A1 regression checks run cleanly

Deferred follow-ups

  • average reducer (2-slot per-key acc + post-process division)
  • Upstream where_* / select* fusion into group_by (PR-B)
  • reverse_take backward index loop on array sources (PR-C)

🤖 Generated with Claude Code

borisbat and others added 2 commits May 18, 2026 21:52
Generalizes plan_group_by along two axes:

A.3 — per-reducer dispatch. `is_bucket_reducer_call` extended to 8 reducer
shapes: bare {sum, min, max, first} and inner-select {sum, min, max, first}
(`<reducer>(select(<bind>._1, <lambda>))`). New `emit_reducer_branches`
helper produces per-reducer (missInit, hitUpdate) pairs:

- min/max: direct `<` / `>` on workhorse acc types; `_::less` fallback for
  non-workhorse (matches the reference at linq.das:1224,1308).
- first: miss-init only (hitUpdate null). Subsequent same-key elements are
  ignored, exploiting the first-key-wins guarantee from PR #2721.
- inner-select min/max: bind the projection result to a per-element temp so
  the inner body evaluates exactly once per source element — matches the
  reference's `select` laziness, and avoids re-running side effects.
- first_or_default is rejected at the arity check (2 args) and cascades.

A.4 — N+1-slot named tuples. The 2-slot recognizer (key + 1 reducer) is
replaced by `recognize_reducer_specs` returning `array<ReducerSpec>`. The
named-tuple form now accepts arbitrarily many reducer slots (key at _0,
reducers at _1.._N). The planner walks the spec array once and concatenates
per-slot missInit + hitUpdate statements into the per-element loop —
N reducers fused into one pass instead of N separate passes.

Field-access into dynamic slots (`entry._{slot}`) is built programmatically
via `mk_slot_ref` since qmacro has no dynamic-field-name splice. The
loop-body emission is now conditional on whether any reducer contributes a
hit update — first-only chains drop the else branch entirely.

Bail paths (all cascade to tier 2):
- key not at slot 0 of the named tuple
- unrecognized reducer in any slot (e.g. `average`)
- first_or_default (2-arg, fails arity)

Tests:
- 18 new parity tests in test_group_by_min_max_first_fold_parity and
  test_group_by_multi_reducer_fold_parity (bare, named, inner-select,
  multi-reducer, empty source, reference parity)
- 5 new AST-shape tests: G5a (min workhorse direct compare), G5b (min
  non-workhorse `_::less`), G6 (first no hit compare), G7 (multi-reducer
  fused pass), G10 (first_or_default cascade)

Both PR-A1 baselines (groupby_count, groupby_sum) hold parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t + multi-reducer

4 new 100K-row benchmarks (m1 SQL / m3 plain LINQ / m3f splice):
- groupby_min: 175 / 111 / 42 ns/op  (2.6× over m3, 4.2× over SQL)
- groupby_max: 173 / 108 / 43 ns/op  (2.5× over m3, 4.0× over SQL)
- groupby_first: — / 71 / 36 ns/op   (2.0× over m3; no direct SQL aggregator)
- groupby_multi_reducer: 189 / 139 / 53 ns/op  (3 reducers fused into 1 pass;
  2.6× over m3, 3.6× over SQL)

All 6 splice variants land within ~36–53 ns/op — per-element work bounded by
the single hash op + slot mutations regardless of which reducer or how many.
Multi-reducer pays ~5 ns per extra slot, still beats SQL by 3.6×.

LINQ.md: refreshed Phase status table, baseline rows, Phase 3+ subsection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 19, 2026 04:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the _fold tier-1 splice for group_by[_lazy] |> select(...) in daslib/linq_fold.das to cover additional reducer shapes (min, max, first, and their inner-select forms) and to fuse multiple reducers from N-slot named-tuples into a single per-element table-update pass, closing remaining BufferGroupBy gaps from the earlier follow-ups.

Changes:

  • Extend reducer recognition to 8 shapes: bare {sum,min,max,first} + inner-select {sum,min,max,first} and add per-reducer miss-init / hit-update emission.
  • Generalize named-tuple group projection handling from 2 slots to N+1 slots (key at _0, reducers at _1.._N) by planning via an array<ReducerSpec> and concatenating per-slot update statements into one fused loop.
  • Add functional parity tests, AST-shape tests, new benchmarks (groupby_min/max/first/multi_reducer), and update benchmarks/sql/LINQ.md coverage notes.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
daslib/linq_fold.das Implements reducer-spec planning, new reducer recognizers, and fused multi-reducer group-by splice emission.
tests/linq/test_linq_fold.das Adds runtime parity tests for min/max/first (bare/named/inner-select) and multi-reducer named tuples.
tests/linq/test_linq_fold_ast.das Adds AST fingerprint tests ensuring splice selection and expected codegen characteristics (direct compare vs _::less, no runtime reducers, fused pass).
benchmarks/sql/LINQ.md Updates phase coverage table and benchmark results/notes to include PR-A2 additions.
benchmarks/sql/groupby_min.das Adds benchmark for inner-select-min group-by splice performance vs baselines.
benchmarks/sql/groupby_max.das Adds benchmark for inner-select-max group-by splice performance vs baselines.
benchmarks/sql/groupby_first.das Adds benchmark for bare-first group-by splice performance (no SQL baseline).
benchmarks/sql/groupby_multi_reducer.das Adds benchmark demonstrating multi-reducer fused pass (count+sum+max) performance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@borisbat borisbat merged commit dd85a67 into master May 19, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants