linq_fold: peel key blocks (group_by + distinct_by) — 16-28 ns/op wins by borisbat · Pull Request #2824 · GaijinEntertainment/daScript

borisbat · 2026-05-23T03:39:17Z

Summary

Bench-driven follow-up after PR #2822 closed the Wave 5 terminator splice ladder. A sweep over benchmarks/sql flagged group_by lanes as consistently 1.2–1.6× slower on the decs splice (m4) than the array splice (m3f). Root cause: plan_group_by_core and plan_distinct's distinct_by arm emitted invoke($keyBlock, sourceVar) per element. Switching to peel_lambda_rename_var / peel_lambda_replace_var inlines _.field as a direct field read, with the existing fallback to invoke(...) for non-peelable blocks.

The win is twofold:

Per-element invoke dispatch eliminated (~3-5 ns/elem in INTERP on both lanes).
On the decs side, the whole-decs_tup lambda argument no longer pins the pruner — build_decs_inner_for_pruned now drops the components only the keyBlock referenced through the tuple shape (e.g. tuple<id:int;name:string;price:int;brand:int;year:int;dealer_id:int> shrinks to tuple<brand:int>), removing per-iter string copies and matching get_ro slots.

Bench (INTERP, 100K rows, ns/op, before → after)

bench	m3f before	m3f after	m4 before	m4 after
groupby_count	36.3	19.5 (-46%)	49.9	22.5 (-55%)
groupby_sum	36.0	18.6	50.1	24.4
groupby_min	42.9	25.5	57.1	29.3
groupby_max	42.8	24.9	56.5	29.6
groupby_average	52.1	30.4	62.7	34.7
groupby_having_count	36.4	19.3	50.0	22.2
groupby_having_hidden_sum	39.9	24.2	57.3	30.1
groupby_multi_reducer	53.3	32.3	63.6	36.6
groupby_select_sum	58.6	37.2	61.4	42.4
groupby_first	35.3	18.5	47.3	29.9
groupby_where_count	23.7	14.7	37.0	19.8
groupby_where_sum	23.3	14.1	36.4	19.7
distinct_by_count (NEW)	33.7	15.3 (-55%)	44.3	19.7 (-56%)

distinct_by_count is the new bench that covers the path the existing distinct_count doesn't touch (it preselects to a scalar first). All other non-groupby/distinct_by lanes unchanged within bench noise.

Bundled

The branch carries two adjacent improvements that surfaced while shipping this:

dastest bench format with 1-decimal ns/op (commit 1). The native column truncated to int64 ns/op; go_export and the JSON message line already used :.1f. Switched native to the same shape, keeping column alignment + zero-allocs green/yellow coloring.
dastest lint enablement (commit 3). dastest/dastest_clargs.das was missing options gen2, which made it parse fine via direct daslang invocation (defaults to v2) but fail under the lint pipeline's compile_file path (defaults to v1) — @clarg_doc = ... is v2-only syntax. That parse error rippled to every lint of dastest/suite.das and blocked pre-push for any dastest touch. Added options gen2, then addressed the ~30 pre-existing PERF/STYLE/LINT warnings it surfaced in dastest/suite.das + 2 STYLE005 in dastest/log.das (the latter via suite's require chain). No behavior change.

Splice-shape test update

test_wave4_pruned_groupby_count_splice_shape previously documented "pruning DISABLED — bucket needs full tuple" as a known limitation; that limitation is now fixed for the count-by-group shape, so the gate asserts 1 get_ro (brand) + 0 for price/year. The parity test (test_wave4_pruned_groupby_count_parity) is unchanged and still green.

Test plan

mcp__daslang__lint clean on daslib/linq_fold.das, benchmarks/sql/distinct_by_count.das, tests/linq/test_linq_from_decs.das, and all touched dastest files
tests/linq 1382/1382 INTERP + 1382/1382 AOT
tests/decs 245/245 INTERP
tests/dasSQLITE 782/782 INTERP
test_aot full lane 2409/2409
Bench sweep over benchmarks/sql (199 benchmarks) — diff shows gains on 12 groupby + new distinct_by; all others unchanged
dastest --bench --bench_format native formatter visual check (3 sample benches)
CI 9-lane matrix
Copilot review

🤖 Generated with Claude Code

Previously the native column truncated to int64 ns/op while go_export and the JSON message line already used `:.1f`. The truncation hid sub-ns deltas that matter at the bench-tuning level (and that the JSON sample collection captures anyway). Switch the native format to the same `:.1f` shape, keeping the existing column-width alignment + zero-allocs green/yellow coloring. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

plan_group_by_core and plan_distinct's distinct_by arm both emitted `invoke($keyBlock, sourceVar)` per element. Switching to peel_lambda_rename_var / peel_lambda_replace_var inlines `_.field` as a direct field read, with the existing fallback to invoke for non-peelable blocks. The win is twofold: • Per-element invoke dispatch eliminated (~3-5 ns/elem in INTERP on both the array and decs lanes). • On the decs side, the whole-`decs_tup` lambda argument no longer pins the pruner — build_decs_inner_for_pruned now drops the components only the keyBlock referenced through the tuple shape (e.g. a 6-field tuple<id:int; name:string; price:int; brand:int; year:int; dealer_id:int> shrinks to tuple<brand:int>), removing per-iter string copies and matching get_ro slots. Bench results (INTERP, 100K rows, ns/op, before -> after): groupby_count m3f 36.3 -> 19.5 (-46%) m4 49.9 -> 22.5 (-55%) groupby_sum m3f 36.0 -> 18.6 m4 50.1 -> 24.4 groupby_min m3f 42.9 -> 25.5 m4 57.1 -> 29.3 groupby_max m3f 42.8 -> 24.9 m4 56.5 -> 29.6 groupby_average m3f 52.1 -> 30.4 m4 62.7 -> 34.7 groupby_having_count m3f 36.4 -> 19.3 m4 50.0 -> 22.2 groupby_having_hidden_sum m3f 39.9 -> 24.2 m4 57.3 -> 30.1 groupby_multi_reducer m3f 53.3 -> 32.3 m4 63.6 -> 36.6 groupby_select_sum m3f 58.6 -> 37.2 m4 61.4 -> 42.4 groupby_first m3f 35.3 -> 18.5 m4 47.3 -> 29.9 groupby_where_count m3f 23.7 -> 14.7 m4 37.0 -> 19.8 groupby_where_sum m3f 23.3 -> 14.1 m4 36.4 -> 19.7 distinct_by_count (NEW) m3f 33.7 -> 15.3 (-55%) m4 44.3 -> 19.7 (-56%) distinct_by_count is the new bench that exercises the distinct_by path the existing distinct_count doesn't touch (it preselects to a scalar). All other non-groupby/distinct_by lanes unchanged within bench noise. test_wave4_pruned_groupby_count_splice_shape updated: the previous test documented "pruning DISABLED — bucket needs full tuple" as a known limitation; that limitation is now fixed for the count-by-group shape, so the shape gate asserts 1 get_ro (brand) + 0 for price/year. 1382/1382 tests/linq + 245/245 tests/decs + 782/782 tests/dasSQLITE green interp; 2409/2409 AOT pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…aced warnings dastest/dastest_clargs.das was missing `options gen2`, which made it parse fine via direct daslang invocation (defaults to v2) but fail under the lint pipeline's compile_file path (defaults to v1) — field annotations (`@clarg_doc = ...`) are v2-only syntax. That parse error rippled to every lint of dastest/suite.das and blocked pre-push for any dastest touch. Adding `options gen2` to dastest_clargs.das unblocks lint, which then surfaces ~30 pre-existing PERF/STYLE/LINT warnings in dastest/suite.das and 2 STYLE005 in dastest/log.das (the latter pulled in via suite's require chain). Per the standing rule that touched files must lint clean, all of them are addressed: • PERF013 ×8 — `+= 1` → `++` on res.{total,errors,passed,failed,skipped} • PERF017 ×1 — `length(allowed) == 0` → `empty(allowed)` • PERF006 ×1 — `reserve(length(testMessages))` before push loop • STYLE005 ×10 — drop braces around single-stmt early exits (suite + log) • STYLE025 ×1 — single-stmt unsafe block with `// nolint:STYLE025` and why (both the deref and the pipe-callee are unsafe — expression form can't wrap a pipe) • LINT002 ×2 — unused loop vars renamed to `_errC` / `_i` • LINT003 ×4 — `var` → `let` on locals that are never reassigned No behavior change. dastest --bench --run still produces the same output; all linq + decs + dasSQLITE + AOT lanes green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Improves linq_fold’s group-by and distinct-by planning by peeling simple key lambdas so _.field becomes a direct field read, eliminating per-element invoke(...) overhead and enabling better decs tuple-slot pruning.

Changes:

Update plan_group_by_core and the distinct_by arms of plan_distinct / plan_decs_distinct to use peel_lambda_rename_var / peel_lambda_replace_var instead of per-element invoke(...).
Update the Wave4 pruning splice-shape test to assert the newly-enabled pruning behavior for count-by-group.
Add a new SQL benchmark (distinct_by_count) and tweak dastest benchmark native formatting to print 1-decimal ns/op; enable gen2 parsing for dastest_clargs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/linq/test_linq_from_decs.das	Updates splice-shape assertion to reflect newly-working pruning for group_by count-by-group.
dastest/suite.das	Refactors for lint/style, adds reserve for message cloning, and prints native `ns/op` with 1 decimal.
dastest/log.das	Minor early-return style cleanups.
dastest/dastest_clargs.das	Enables `options gen2` so gen2-only clargs annotations parse under lint/compile_file.
daslib/linq_fold.das	Uses peeled key expressions for `group_by` and `distinct_by` to avoid per-element `invoke(...)`.
benchmarks/sql/distinct_by_count.das	Adds a benchmark that exercises the `distinct_by

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two slices closing the final m4-vs-m3f gaps from #2824's residual outliers. Slice 1 — trivial-let elision (closes sum_aggregate_m4 1.3ns gap): When `_select(_.userName)` peels to a single `decs_tup.<field>` reference, rename the chain bind directly to the iter var instead of synthesizing `decs_sel_N`. wrap_decs_chain skips emitting the `let decs_sel_N = car_price` binding entirely; the action's `acc += <iterVar>` references the iter var natively. Required extending DecsTupUsageScanner with an iter-var-→-user-name reverse map so bare iter-var refs still seed the pruner (previously: empty usedNames fell through to unpruned-default, defeating the elision). Slice 2 — reverse_take skip-into-tail (closes reverse_take_m4 5.2× gap): For `from_decs(...).reverse().take(N).to_array()` with no where/select, emit a two-pass invoke: pass 1 sums `arch.size` (no entity load), pass 2 uses for_each_archetype_find to skip whole archetypes whose cumulative size still fits below the skip threshold, then a per-iter skip-counter through the partial archetype, push into a takeN-sized buffer, and `return true` to stop iteration once buf is full. reverse_inplace runs on the small N buffer at end, not the full source. where/select fall through to the legacy buffer+reverse_inplace+resize emit unchanged. Bench (INTERP, 100K rows, ns/op): - sum_aggregate_m4 3.4 → 2.1 matches m3f (was the systemic 1.3ns gap) - reverse_take_m4 48.0 → 9.2 5.2× win, allocs 42B → 1B - select_where_sum_m4 7.5 → 7.5 matches m3f (elision benefits this too) - contains_match_m4 2.1 → 1.4 beats m3f at 2.2 - chained_where_m4 6.6 → 6.6 no regression - count_aggregate_m4 4.1 → 4.1 no regression Tests: - New splice-shape assertions: trivial-let elision (no decs_sel binding for `_select(_.val).sum()` and `_where(_)._select(_.val).sum()`) - New splice-shape for skip-into-tail (for_each_archetype_find count==1, decs_skips local presence) - New parity tests: multi-archetype reverse+take, take(N>total), empty source — covers the whole-archetype-skip + partial-archetype + early-return arms 1388/1388 linq + 245/245 decs + 782/782 dasSQLITE green INTERP. MCP + CI lint clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

borisbat and others added 3 commits May 22, 2026 20:12

Copilot AI review requested due to automatic review settings May 23, 2026 03:39

Copilot started reviewing on behalf of borisbat May 23, 2026 03:39 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

borisbat merged commit b596317 into master May 23, 2026
30 checks passed

This was referenced May 23, 2026

linq_fold: elide per-iter decs_tup wrap — 36 m4 lanes -3.0 ns/op avg #2828

Closed

linq_fold: trivial-let elision + reverse_take skip-into-tail — closes the m4 ladder #2834

Merged

borisbat deleted the bbatkin/linq-fold-peel-key-blocks branch May 30, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linq_fold: peel key blocks (group_by + distinct_by) — 16-28 ns/op wins#2824

linq_fold: peel key blocks (group_by + distinct_by) — 16-28 ns/op wins#2824
borisbat merged 3 commits into
masterfrom
bbatkin/linq-fold-peel-key-blocks

borisbat commented May 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

borisbat commented May 23, 2026

Summary

Bench (INTERP, 100K rows, ns/op, before → after)

Bundled

Splice-shape test update

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants