linq_fold: peel key blocks (group_by + distinct_by) — 16-28 ns/op wins#2824
Merged
Conversation
Previously the native column truncated to int64 ns/op while go_export and the JSON message line already used `:.1f`. The truncation hid sub-ns deltas that matter at the bench-tuning level (and that the JSON sample collection captures anyway). Switch the native format to the same `:.1f` shape, keeping the existing column-width alignment + zero-allocs green/yellow coloring. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
plan_group_by_core and plan_distinct's distinct_by arm both emitted
`invoke($keyBlock, sourceVar)` per element. Switching to peel_lambda_rename_var
/ peel_lambda_replace_var inlines `_.field` as a direct field read, with the
existing fallback to invoke for non-peelable blocks. The win is twofold:
• Per-element invoke dispatch eliminated (~3-5 ns/elem in INTERP on both the
array and decs lanes).
• On the decs side, the whole-`decs_tup` lambda argument no longer pins the
pruner — build_decs_inner_for_pruned now drops the components only the
keyBlock referenced through the tuple shape (e.g. a 6-field tuple<id:int;
name:string; price:int; brand:int; year:int; dealer_id:int> shrinks to
tuple<brand:int>), removing per-iter string copies and matching get_ro
slots.
Bench results (INTERP, 100K rows, ns/op, before -> after):
groupby_count m3f 36.3 -> 19.5 (-46%) m4 49.9 -> 22.5 (-55%)
groupby_sum m3f 36.0 -> 18.6 m4 50.1 -> 24.4
groupby_min m3f 42.9 -> 25.5 m4 57.1 -> 29.3
groupby_max m3f 42.8 -> 24.9 m4 56.5 -> 29.6
groupby_average m3f 52.1 -> 30.4 m4 62.7 -> 34.7
groupby_having_count m3f 36.4 -> 19.3 m4 50.0 -> 22.2
groupby_having_hidden_sum m3f 39.9 -> 24.2 m4 57.3 -> 30.1
groupby_multi_reducer m3f 53.3 -> 32.3 m4 63.6 -> 36.6
groupby_select_sum m3f 58.6 -> 37.2 m4 61.4 -> 42.4
groupby_first m3f 35.3 -> 18.5 m4 47.3 -> 29.9
groupby_where_count m3f 23.7 -> 14.7 m4 37.0 -> 19.8
groupby_where_sum m3f 23.3 -> 14.1 m4 36.4 -> 19.7
distinct_by_count (NEW) m3f 33.7 -> 15.3 (-55%) m4 44.3 -> 19.7 (-56%)
distinct_by_count is the new bench that exercises the distinct_by path the
existing distinct_count doesn't touch (it preselects to a scalar). All other
non-groupby/distinct_by lanes unchanged within bench noise.
test_wave4_pruned_groupby_count_splice_shape updated: the previous test
documented "pruning DISABLED — bucket needs full tuple" as a known limitation;
that limitation is now fixed for the count-by-group shape, so the shape gate
asserts 1 get_ro (brand) + 0 for price/year.
1382/1382 tests/linq + 245/245 tests/decs + 782/782 tests/dasSQLITE green
interp; 2409/2409 AOT pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…aced warnings
dastest/dastest_clargs.das was missing `options gen2`, which made it parse
fine via direct daslang invocation (defaults to v2) but fail under the lint
pipeline's compile_file path (defaults to v1) — field annotations
(`@clarg_doc = ...`) are v2-only syntax. That parse error rippled to every
lint of dastest/suite.das and blocked pre-push for any dastest touch.
Adding `options gen2` to dastest_clargs.das unblocks lint, which then surfaces
~30 pre-existing PERF/STYLE/LINT warnings in dastest/suite.das and 2 STYLE005
in dastest/log.das (the latter pulled in via suite's require chain). Per the
standing rule that touched files must lint clean, all of them are addressed:
• PERF013 ×8 — `+= 1` → `++` on res.{total,errors,passed,failed,skipped}
• PERF017 ×1 — `length(allowed) == 0` → `empty(allowed)`
• PERF006 ×1 — `reserve(length(testMessages))` before push loop
• STYLE005 ×10 — drop braces around single-stmt early exits (suite + log)
• STYLE025 ×1 — single-stmt unsafe block with `// nolint:STYLE025` and
why (both the deref and the pipe-callee are unsafe — expression form
can't wrap a pipe)
• LINT002 ×2 — unused loop vars renamed to `_errC` / `_i`
• LINT003 ×4 — `var` → `let` on locals that are never reassigned
No behavior change. dastest --bench --run still produces the same output;
all linq + decs + dasSQLITE + AOT lanes green.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Improves linq_fold’s group-by and distinct-by planning by peeling simple key lambdas so _.field becomes a direct field read, eliminating per-element invoke(...) overhead and enabling better decs tuple-slot pruning.
Changes:
- Update
plan_group_by_coreand thedistinct_byarms ofplan_distinct/plan_decs_distinctto usepeel_lambda_rename_var/peel_lambda_replace_varinstead of per-elementinvoke(...). - Update the Wave4 pruning splice-shape test to assert the newly-enabled pruning behavior for count-by-group.
- Add a new SQL benchmark (
distinct_by_count) and tweakdastestbenchmark native formatting to print 1-decimalns/op; enable gen2 parsing fordastest_clargs.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/linq/test_linq_from_decs.das | Updates splice-shape assertion to reflect newly-working pruning for group_by count-by-group. |
| dastest/suite.das | Refactors for lint/style, adds reserve for message cloning, and prints native ns/op with 1 decimal. |
| dastest/log.das | Minor early-return style cleanups. |
| dastest/dastest_clargs.das | Enables options gen2 so gen2-only clargs annotations parse under lint/compile_file. |
| daslib/linq_fold.das | Uses peeled key expressions for group_by and distinct_by to avoid per-element invoke(...). |
| benchmarks/sql/distinct_by_count.das | Adds a benchmark that exercises the `distinct_by |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This was referenced May 23, 2026
borisbat
added a commit
that referenced
this pull request
May 23, 2026
Two slices closing the final m4-vs-m3f gaps from #2824's residual outliers. Slice 1 — trivial-let elision (closes sum_aggregate_m4 1.3ns gap): When `_select(_.userName)` peels to a single `decs_tup.<field>` reference, rename the chain bind directly to the iter var instead of synthesizing `decs_sel_N`. wrap_decs_chain skips emitting the `let decs_sel_N = car_price` binding entirely; the action's `acc += <iterVar>` references the iter var natively. Required extending DecsTupUsageScanner with an iter-var-→-user-name reverse map so bare iter-var refs still seed the pruner (previously: empty usedNames fell through to unpruned-default, defeating the elision). Slice 2 — reverse_take skip-into-tail (closes reverse_take_m4 5.2× gap): For `from_decs(...).reverse().take(N).to_array()` with no where/select, emit a two-pass invoke: pass 1 sums `arch.size` (no entity load), pass 2 uses for_each_archetype_find to skip whole archetypes whose cumulative size still fits below the skip threshold, then a per-iter skip-counter through the partial archetype, push into a takeN-sized buffer, and `return true` to stop iteration once buf is full. reverse_inplace runs on the small N buffer at end, not the full source. where/select fall through to the legacy buffer+reverse_inplace+resize emit unchanged. Bench (INTERP, 100K rows, ns/op): - sum_aggregate_m4 3.4 → 2.1 matches m3f (was the systemic 1.3ns gap) - reverse_take_m4 48.0 → 9.2 5.2× win, allocs 42B → 1B - select_where_sum_m4 7.5 → 7.5 matches m3f (elision benefits this too) - contains_match_m4 2.1 → 1.4 beats m3f at 2.2 - chained_where_m4 6.6 → 6.6 no regression - count_aggregate_m4 4.1 → 4.1 no regression Tests: - New splice-shape assertions: trivial-let elision (no decs_sel binding for `_select(_.val).sum()` and `_where(_)._select(_.val).sum()`) - New splice-shape for skip-into-tail (for_each_archetype_find count==1, decs_skips local presence) - New parity tests: multi-archetype reverse+take, take(N>total), empty source — covers the whole-archetype-skip + partial-archetype + early-return arms 1388/1388 linq + 245/245 decs + 782/782 dasSQLITE green INTERP. MCP + CI lint clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bench-driven follow-up after PR #2822 closed the Wave 5 terminator splice ladder. A sweep over
benchmarks/sqlflagged group_by lanes as consistently 1.2–1.6× slower on the decs splice (m4) than the array splice (m3f). Root cause:plan_group_by_coreandplan_distinct'sdistinct_byarm emittedinvoke($keyBlock, sourceVar)per element. Switching topeel_lambda_rename_var/peel_lambda_replace_varinlines_.fieldas a direct field read, with the existing fallback toinvoke(...)for non-peelable blocks.The win is twofold:
decs_tuplambda argument no longer pins the pruner —build_decs_inner_for_prunednow drops the components only the keyBlock referenced through the tuple shape (e.g.tuple<id:int;name:string;price:int;brand:int;year:int;dealer_id:int>shrinks totuple<brand:int>), removing per-iter string copies and matchingget_roslots.Bench (INTERP, 100K rows, ns/op, before → after)
distinct_by_countis the new bench that covers the path the existingdistinct_countdoesn't touch (it preselects to a scalar first). All other non-groupby/distinct_by lanes unchanged within bench noise.Bundled
The branch carries two adjacent improvements that surfaced while shipping this:
dastest bench format with 1-decimal ns/op (commit 1). The
nativecolumn truncated toint64ns/op;go_exportand the JSON message line already used:.1f. Switched native to the same shape, keeping column alignment + zero-allocs green/yellow coloring.dastest lint enablement (commit 3).
dastest/dastest_clargs.daswas missingoptions gen2, which made it parse fine via direct daslang invocation (defaults to v2) but fail under the lint pipeline'scompile_filepath (defaults to v1) —@clarg_doc = ...is v2-only syntax. That parse error rippled to every lint ofdastest/suite.dasand blocked pre-push for any dastest touch. Addedoptions gen2, then addressed the ~30 pre-existing PERF/STYLE/LINT warnings it surfaced indastest/suite.das+ 2 STYLE005 indastest/log.das(the latter via suite's require chain). No behavior change.Splice-shape test update
test_wave4_pruned_groupby_count_splice_shapepreviously documented "pruning DISABLED — bucket needs full tuple" as a known limitation; that limitation is now fixed for the count-by-group shape, so the gate asserts 1get_ro(brand) + 0 forprice/year. The parity test (test_wave4_pruned_groupby_count_parity) is unchanged and still green.Test plan
mcp__daslang__lintclean ondaslib/linq_fold.das,benchmarks/sql/distinct_by_count.das,tests/linq/test_linq_from_decs.das, and all touched dastest filestests/linq1382/1382 INTERP + 1382/1382 AOTtests/decs245/245 INTERPtests/dasSQLITE782/782 INTERPtest_aotfull lane 2409/2409benchmarks/sql(199 benchmarks) — diff shows gains on 12 groupby + new distinct_by; all others unchangeddastest --bench --bench_format nativeformatter visual check (3 sample benches)🤖 Generated with Claude Code