Skip to content

linq_fold: peel key blocks (group_by + distinct_by) — 16-28 ns/op wins#2824

Merged
borisbat merged 3 commits into
masterfrom
bbatkin/linq-fold-peel-key-blocks
May 23, 2026
Merged

linq_fold: peel key blocks (group_by + distinct_by) — 16-28 ns/op wins#2824
borisbat merged 3 commits into
masterfrom
bbatkin/linq-fold-peel-key-blocks

Conversation

@borisbat
Copy link
Copy Markdown
Collaborator

Summary

Bench-driven follow-up after PR #2822 closed the Wave 5 terminator splice ladder. A sweep over benchmarks/sql flagged group_by lanes as consistently 1.2–1.6× slower on the decs splice (m4) than the array splice (m3f). Root cause: plan_group_by_core and plan_distinct's distinct_by arm emitted invoke($keyBlock, sourceVar) per element. Switching to peel_lambda_rename_var / peel_lambda_replace_var inlines _.field as a direct field read, with the existing fallback to invoke(...) for non-peelable blocks.

The win is twofold:

  • Per-element invoke dispatch eliminated (~3-5 ns/elem in INTERP on both lanes).
  • On the decs side, the whole-decs_tup lambda argument no longer pins the pruner — build_decs_inner_for_pruned now drops the components only the keyBlock referenced through the tuple shape (e.g. tuple<id:int;name:string;price:int;brand:int;year:int;dealer_id:int> shrinks to tuple<brand:int>), removing per-iter string copies and matching get_ro slots.

Bench (INTERP, 100K rows, ns/op, before → after)

bench m3f before m3f after m4 before m4 after
groupby_count 36.3 19.5 (-46%) 49.9 22.5 (-55%)
groupby_sum 36.0 18.6 50.1 24.4
groupby_min 42.9 25.5 57.1 29.3
groupby_max 42.8 24.9 56.5 29.6
groupby_average 52.1 30.4 62.7 34.7
groupby_having_count 36.4 19.3 50.0 22.2
groupby_having_hidden_sum 39.9 24.2 57.3 30.1
groupby_multi_reducer 53.3 32.3 63.6 36.6
groupby_select_sum 58.6 37.2 61.4 42.4
groupby_first 35.3 18.5 47.3 29.9
groupby_where_count 23.7 14.7 37.0 19.8
groupby_where_sum 23.3 14.1 36.4 19.7
distinct_by_count (NEW) 33.7 15.3 (-55%) 44.3 19.7 (-56%)

distinct_by_count is the new bench that covers the path the existing distinct_count doesn't touch (it preselects to a scalar first). All other non-groupby/distinct_by lanes unchanged within bench noise.

Bundled

The branch carries two adjacent improvements that surfaced while shipping this:

  1. dastest bench format with 1-decimal ns/op (commit 1). The native column truncated to int64 ns/op; go_export and the JSON message line already used :.1f. Switched native to the same shape, keeping column alignment + zero-allocs green/yellow coloring.

  2. dastest lint enablement (commit 3). dastest/dastest_clargs.das was missing options gen2, which made it parse fine via direct daslang invocation (defaults to v2) but fail under the lint pipeline's compile_file path (defaults to v1) — @clarg_doc = ... is v2-only syntax. That parse error rippled to every lint of dastest/suite.das and blocked pre-push for any dastest touch. Added options gen2, then addressed the ~30 pre-existing PERF/STYLE/LINT warnings it surfaced in dastest/suite.das + 2 STYLE005 in dastest/log.das (the latter via suite's require chain). No behavior change.

Splice-shape test update

test_wave4_pruned_groupby_count_splice_shape previously documented "pruning DISABLED — bucket needs full tuple" as a known limitation; that limitation is now fixed for the count-by-group shape, so the gate asserts 1 get_ro (brand) + 0 for price/year. The parity test (test_wave4_pruned_groupby_count_parity) is unchanged and still green.

Test plan

  • mcp__daslang__lint clean on daslib/linq_fold.das, benchmarks/sql/distinct_by_count.das, tests/linq/test_linq_from_decs.das, and all touched dastest files
  • tests/linq 1382/1382 INTERP + 1382/1382 AOT
  • tests/decs 245/245 INTERP
  • tests/dasSQLITE 782/782 INTERP
  • test_aot full lane 2409/2409
  • Bench sweep over benchmarks/sql (199 benchmarks) — diff shows gains on 12 groupby + new distinct_by; all others unchanged
  • dastest --bench --bench_format native formatter visual check (3 sample benches)
  • CI 9-lane matrix
  • Copilot review

🤖 Generated with Claude Code

borisbat and others added 3 commits May 22, 2026 20:12
Previously the native column truncated to int64 ns/op while go_export and the
JSON message line already used `:.1f`. The truncation hid sub-ns deltas that
matter at the bench-tuning level (and that the JSON sample collection captures
anyway). Switch the native format to the same `:.1f` shape, keeping the
existing column-width alignment + zero-allocs green/yellow coloring.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
plan_group_by_core and plan_distinct's distinct_by arm both emitted
`invoke($keyBlock, sourceVar)` per element. Switching to peel_lambda_rename_var
/ peel_lambda_replace_var inlines `_.field` as a direct field read, with the
existing fallback to invoke for non-peelable blocks. The win is twofold:

  • Per-element invoke dispatch eliminated (~3-5 ns/elem in INTERP on both the
    array and decs lanes).
  • On the decs side, the whole-`decs_tup` lambda argument no longer pins the
    pruner — build_decs_inner_for_pruned now drops the components only the
    keyBlock referenced through the tuple shape (e.g. a 6-field tuple<id:int;
    name:string; price:int; brand:int; year:int; dealer_id:int> shrinks to
    tuple<brand:int>), removing per-iter string copies and matching get_ro
    slots.

Bench results (INTERP, 100K rows, ns/op, before -> after):

  groupby_count             m3f 36.3 -> 19.5 (-46%)  m4 49.9 -> 22.5 (-55%)
  groupby_sum               m3f 36.0 -> 18.6         m4 50.1 -> 24.4
  groupby_min               m3f 42.9 -> 25.5         m4 57.1 -> 29.3
  groupby_max               m3f 42.8 -> 24.9         m4 56.5 -> 29.6
  groupby_average           m3f 52.1 -> 30.4         m4 62.7 -> 34.7
  groupby_having_count      m3f 36.4 -> 19.3         m4 50.0 -> 22.2
  groupby_having_hidden_sum m3f 39.9 -> 24.2         m4 57.3 -> 30.1
  groupby_multi_reducer     m3f 53.3 -> 32.3         m4 63.6 -> 36.6
  groupby_select_sum        m3f 58.6 -> 37.2         m4 61.4 -> 42.4
  groupby_first             m3f 35.3 -> 18.5         m4 47.3 -> 29.9
  groupby_where_count       m3f 23.7 -> 14.7         m4 37.0 -> 19.8
  groupby_where_sum         m3f 23.3 -> 14.1         m4 36.4 -> 19.7
  distinct_by_count (NEW)   m3f 33.7 -> 15.3 (-55%)  m4 44.3 -> 19.7 (-56%)

distinct_by_count is the new bench that exercises the distinct_by path the
existing distinct_count doesn't touch (it preselects to a scalar). All other
non-groupby/distinct_by lanes unchanged within bench noise.

test_wave4_pruned_groupby_count_splice_shape updated: the previous test
documented "pruning DISABLED — bucket needs full tuple" as a known limitation;
that limitation is now fixed for the count-by-group shape, so the shape gate
asserts 1 get_ro (brand) + 0 for price/year.

1382/1382 tests/linq + 245/245 tests/decs + 782/782 tests/dasSQLITE green
interp; 2409/2409 AOT pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…aced warnings

dastest/dastest_clargs.das was missing `options gen2`, which made it parse
fine via direct daslang invocation (defaults to v2) but fail under the lint
pipeline's compile_file path (defaults to v1) — field annotations
(`@clarg_doc = ...`) are v2-only syntax. That parse error rippled to every
lint of dastest/suite.das and blocked pre-push for any dastest touch.

Adding `options gen2` to dastest_clargs.das unblocks lint, which then surfaces
~30 pre-existing PERF/STYLE/LINT warnings in dastest/suite.das and 2 STYLE005
in dastest/log.das (the latter pulled in via suite's require chain). Per the
standing rule that touched files must lint clean, all of them are addressed:

  • PERF013 ×8 — `+= 1` → `++` on res.{total,errors,passed,failed,skipped}
  • PERF017 ×1 — `length(allowed) == 0` → `empty(allowed)`
  • PERF006 ×1 — `reserve(length(testMessages))` before push loop
  • STYLE005 ×10 — drop braces around single-stmt early exits (suite + log)
  • STYLE025 ×1 — single-stmt unsafe block with `// nolint:STYLE025` and
    why (both the deref and the pipe-callee are unsafe — expression form
    can't wrap a pipe)
  • LINT002 ×2 — unused loop vars renamed to `_errC` / `_i`
  • LINT003 ×4 — `var` → `let` on locals that are never reassigned

No behavior change. dastest --bench --run still produces the same output;
all linq + decs + dasSQLITE + AOT lanes green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 23, 2026 03:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves linq_fold’s group-by and distinct-by planning by peeling simple key lambdas so _.field becomes a direct field read, eliminating per-element invoke(...) overhead and enabling better decs tuple-slot pruning.

Changes:

  • Update plan_group_by_core and the distinct_by arms of plan_distinct / plan_decs_distinct to use peel_lambda_rename_var / peel_lambda_replace_var instead of per-element invoke(...).
  • Update the Wave4 pruning splice-shape test to assert the newly-enabled pruning behavior for count-by-group.
  • Add a new SQL benchmark (distinct_by_count) and tweak dastest benchmark native formatting to print 1-decimal ns/op; enable gen2 parsing for dastest_clargs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/linq/test_linq_from_decs.das Updates splice-shape assertion to reflect newly-working pruning for group_by count-by-group.
dastest/suite.das Refactors for lint/style, adds reserve for message cloning, and prints native ns/op with 1 decimal.
dastest/log.das Minor early-return style cleanups.
dastest/dastest_clargs.das Enables options gen2 so gen2-only clargs annotations parse under lint/compile_file.
daslib/linq_fold.das Uses peeled key expressions for group_by and distinct_by to avoid per-element invoke(...).
benchmarks/sql/distinct_by_count.das Adds a benchmark that exercises the `distinct_by

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@borisbat borisbat merged commit b596317 into master May 23, 2026
30 checks passed
borisbat added a commit that referenced this pull request May 23, 2026
Two slices closing the final m4-vs-m3f gaps from #2824's residual outliers.

Slice 1 — trivial-let elision (closes sum_aggregate_m4 1.3ns gap):
When `_select(_.userName)` peels to a single `decs_tup.<field>` reference,
rename the chain bind directly to the iter var instead of synthesizing
`decs_sel_N`. wrap_decs_chain skips emitting the `let decs_sel_N = car_price`
binding entirely; the action's `acc += <iterVar>` references the iter var
natively. Required extending DecsTupUsageScanner with an iter-var-→-user-name
reverse map so bare iter-var refs still seed the pruner (previously: empty
usedNames fell through to unpruned-default, defeating the elision).

Slice 2 — reverse_take skip-into-tail (closes reverse_take_m4 5.2× gap):
For `from_decs(...).reverse().take(N).to_array()` with no where/select,
emit a two-pass invoke: pass 1 sums `arch.size` (no entity load), pass 2
uses for_each_archetype_find to skip whole archetypes whose cumulative
size still fits below the skip threshold, then a per-iter skip-counter
through the partial archetype, push into a takeN-sized buffer, and `return
true` to stop iteration once buf is full. reverse_inplace runs on the
small N buffer at end, not the full source. where/select fall through to
the legacy buffer+reverse_inplace+resize emit unchanged.

Bench (INTERP, 100K rows, ns/op):
- sum_aggregate_m4    3.4 → 2.1   matches m3f (was the systemic 1.3ns gap)
- reverse_take_m4    48.0 → 9.2   5.2× win, allocs 42B → 1B
- select_where_sum_m4 7.5 → 7.5   matches m3f (elision benefits this too)
- contains_match_m4   2.1 → 1.4   beats m3f at 2.2
- chained_where_m4    6.6 → 6.6   no regression
- count_aggregate_m4  4.1 → 4.1   no regression

Tests:
- New splice-shape assertions: trivial-let elision (no decs_sel binding
  for `_select(_.val).sum()` and `_where(_)._select(_.val).sum()`)
- New splice-shape for skip-into-tail (for_each_archetype_find count==1,
  decs_skips local presence)
- New parity tests: multi-archetype reverse+take, take(N>total),
  empty source — covers the whole-archetype-skip + partial-archetype +
  early-return arms

1388/1388 linq + 245/245 decs + 782/782 dasSQLITE green INTERP. MCP + CI
lint clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@borisbat borisbat deleted the bbatkin/linq-fold-peel-key-blocks branch May 30, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants