Skip to content

linq_fold: trivial-let elision + reverse_take skip-into-tail — closes the m4 ladder#2834

Merged
borisbat merged 3 commits into
masterfrom
bbatkin/linq-fold-trivial-let-and-reverse-take
May 23, 2026
Merged

linq_fold: trivial-let elision + reverse_take skip-into-tail — closes the m4 ladder#2834
borisbat merged 3 commits into
masterfrom
bbatkin/linq-fold-trivial-let-and-reverse-take

Conversation

@borisbat
Copy link
Copy Markdown
Collaborator

Two slices that close the last m4-vs-m3f outliers from #2824's bench snapshot. Wraps the m4 perf push before docs / tutorials.

Slice 1 — trivial-let elision

Closes: sum_aggregate_m4 1.3ns systemic gap vs m3f.

When _select(_.userName) peels to a single decs_tup.<field> reference, rename the chain bind directly to the corresponding iter var instead of synthesizing decs_sel_N. wrap_decs_chain skips emitting let decs_sel_N = car_price entirely; the action's acc += <iterVar> references the iter var natively.

Required extending DecsTupUsageScanner with an iter-var → user-name reverse map so bare iter-var refs still seed the pruner (previously: empty usedNames fell through to unpruned-default, defeating the elision).

Slice 2 — reverse_take skip-into-tail

Closes: reverse_take_m4 48 → 9.2 ns/op (5.2× win, allocs 42B → 1B).

For from_decs_template(...).reverse().take(N).to_array() with no where/select, emit a two-pass invoke:

  1. Pass 1 — sum arch.size across archetypes (no entity load, just archetype-header walk)
  2. Pass 2for_each_archetype_find skips whole archetypes whose cumulative size still fits below skip = total - N; the partial archetype uses a per-iter skip-counter; once the takeN-sized buffer fills, return true stops iteration across remaining archetypes
  3. reverse_inplace runs on the small N buffer at end, not the full source

where/select fall through to the legacy buffer + reverse_inplace + resize emit unchanged — a where filter invalidates the size-based skip (count after filter is unknown without iterating).

Bench results (INTERP, 100K rows, ns/op)

Lane Before After Delta m3f
sum_aggregate_m4 3.4 2.1 -1.3 2.1
reverse_take_m4 48.0 9.2 -38.8 0.0
select_where_sum_m4 7.5 7.5 0.0 7.5
contains_match_m4 2.1 1.4 -0.7 2.2
chained_where_m4 6.6 6.6 0.0 6.6
count_aggregate_m4 4.1 4.1 0.0 4.1

sum_aggregate_m4 now matches m3f exactly. reverse_take_m4 doesn't hit m3f's 0.0 (m3f leverages array-side R6 backward-index access, which decs can't replicate without indexed-component-array helpers) but the skip-into-tail cuts the remaining cost to just the per-element iter ticks. contains_match_m4 beats m3f at 2.2.

Tests

  • test_unroll_select_sum_trivial_let_elision_splice_shape — asserts decs_sel does not appear in _select(_.val).sum() splice body
  • test_unroll_where_select_sum_trivial_let_elision_splice_shape — same elision after a _where filter
  • test_unroll5d_reverse_take_skip_into_tail_splice_shape — asserts for_each_archetype_find count + presence of decs_skips skip-counter local
  • test_reverse_take_multi_archetype_parity — exercises whole-archetype-skip + partial-archetype skip-counter
  • test_reverse_take_skip_zero_when_take_exceeds_totalactualTake = totalCount branch (early-exit doesn't fire)
  • test_reverse_take_empty_sourcetotalCount = 0 early-return before for_each_archetype_find

Verification

  • 1388/1388 linq + 245/245 decs + 782/782 dasSQLITE green INTERP
  • MCP lint + CI-style lint both clean
  • AOT lane (and JIT) deferred to CI

Out of scope

  • Closing the residual reverse_take 9.2 ns/op gap vs m3f's 0.0 would require either an indexed-component-array helper (so the inner for can start at an arbitrary offset, skipping iter ticks too) or moving the skip math into decs itself. Separate slice if it ever matters in production.
  • select is included in the skip-into-tail bail. It only affects element shape, not count, so it could in principle stay on the fast path — but the v1 emit doesn't carry projection through the inner-for body. Easy follow-up.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 23, 2026 12:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens the decs (m4) linq-fold unroll path to close remaining performance outliers versus the array (m3f) lane, focusing on eliminating redundant binds in trivial projections and adding a specialized fast path for reverse().take(N).to_array().

Changes:

  • Add “trivial-let elision” so single-field _select(_.field) projections can reuse the underlying iter var directly and avoid emitting no-op decs_sel_* bindings; extend decs tuple usage scanning to recognize bare iter-var usage so pruning still works.
  • Add a two-pass reverse().take(N).to_array() decs fast path that first sums archetype sizes, then skips into the tail using for_each_archetype_find + a per-iter skip counter, and reverses only the small N-sized buffer.
  • Add new splice-shape and parity tests covering the new behaviors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
daslib/linq_fold.das Implements trivial-let elision (with updated decs usage scanning) and the reverse+take skip-into-tail fast path in the decs reverse planner.
tests/linq/test_linq_from_decs.das Adds splice-shape assertions and parity tests intended to validate the new decs unroll optimizations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread daslib/linq_fold.das
Comment thread daslib/linq_fold.das
Comment thread tests/linq/test_linq_from_decs.das
borisbat and others added 2 commits May 23, 2026 09:59
Two slices closing the final m4-vs-m3f gaps from #2824's residual outliers.

Slice 1 — trivial-let elision (closes sum_aggregate_m4 1.3ns gap):
When `_select(_.userName)` peels to a single `decs_tup.<field>` reference,
rename the chain bind directly to the iter var instead of synthesizing
`decs_sel_N`. wrap_decs_chain skips emitting the `let decs_sel_N = car_price`
binding entirely; the action's `acc += <iterVar>` references the iter var
natively. Required extending DecsTupUsageScanner with an iter-var-→-user-name
reverse map so bare iter-var refs still seed the pruner (previously: empty
usedNames fell through to unpruned-default, defeating the elision).

Slice 2 — reverse_take skip-into-tail (closes reverse_take_m4 5.2× gap):
For `from_decs(...).reverse().take(N).to_array()` with no where/select,
emit a two-pass invoke: pass 1 sums `arch.size` (no entity load), pass 2
uses for_each_archetype_find to skip whole archetypes whose cumulative
size still fits below the skip threshold, then a per-iter skip-counter
through the partial archetype, push into a takeN-sized buffer, and `return
true` to stop iteration once buf is full. reverse_inplace runs on the
small N buffer at end, not the full source. where/select fall through to
the legacy buffer+reverse_inplace+resize emit unchanged.

Bench (INTERP, 100K rows, ns/op):
- sum_aggregate_m4    3.4 → 2.1   matches m3f (was the systemic 1.3ns gap)
- reverse_take_m4    48.0 → 9.2   5.2× win, allocs 42B → 1B
- select_where_sum_m4 7.5 → 7.5   matches m3f (elision benefits this too)
- contains_match_m4   2.1 → 1.4   beats m3f at 2.2
- chained_where_m4    6.6 → 6.6   no regression
- count_aggregate_m4  4.1 → 4.1   no regression

Tests:
- New splice-shape assertions: trivial-let elision (no decs_sel binding
  for `_select(_.val).sum()` and `_where(_)._select(_.val).sum()`)
- New splice-shape for skip-into-tail (for_each_archetype_find count==1,
  decs_skips local presence)
- New parity tests: multi-archetype reverse+take, take(N>total),
  empty source — covers the whole-archetype-skip + partial-archetype +
  early-return arms

1388/1388 linq + 245/245 decs + 782/782 dasSQLITE green INTERP. MCP + CI
lint clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
#2835 fixed the typer-pass-order #2830 that originally tripped this test
on the extended_checks (linux, 64) lane. With master now containing the
fix, the test compiles cleanly on all lanes. Re-adding it covers a case
the current suite missed:

  from_decs_template(type<Row>)._where(_.a >= 0)._where(_.b >= 0).count()

Three chained single-field _where_s — all 3 fields read via field access,
no whole-var ref. The splice must keep all 3 get_ros (no slot pruning)
but elide the named-tuple bind (no decs_tup in the body, iter vars read
directly). Lesson saved to memory: not every CI lane runs every test, so
"platform-specific" failures often mean "we only check this on one
platform" — not that the bug itself is platform-specific.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@borisbat borisbat force-pushed the bbatkin/linq-fold-trivial-let-and-reverse-take branch from 836aee0 to eec6002 Compare May 23, 2026 17:02
The original test created one A-only and one B-only archetype, but
the query `from_decs_template(type<RevTakeMultiArchA>)` only matches A
— the B archetype never enters for_each_archetype, so the cross-
archetype skipping arm wasn't actually exercised.

Now creates two MATCHING archetypes: both have `rev2_id` (so both
satisfy the query), but the second group also has the rev2_b_* extras
which lands it in a separate archetype class. With A1=4 + A2=5 →
totalCount=9, take(3) → skip=6: A1 (size 4) skipped via the size-sum
arm, A2 enters with skipsLeft=2 → drains 2, pushes 3, returns true.
Exercises both the whole-archetype-skip and partial-archetype +
early-exit paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@borisbat borisbat merged commit 81823e4 into master May 23, 2026
29 checks passed
pull Bot pushed a commit to forksnd/daScript that referenced this pull request May 26, 2026
Step 3 — emit_loop_or_count_lane_decs:
  Reconstructs flatten_linq-shaped calls array from captures (head + range
  ops + term) and dispatches to existing emit_decs_* lane fns. Per-adapter
  state-hoist shape stays (array binds source as invoke arg, decs zero-arg
  with state above for_each_archetype) — D1 keeps the 4 array lane fns
  untouched.

Step 4+5 — 4 decs planners now thin pattern-table stubs:
  plan_decs_unroll       → plan_loop_or_count_patterns + Decs adapter
  plan_decs_order_family → plan_order_family_patterns + Decs adapter
                           (Row 4 buffer_helper_dispatch gated by array_source;
                            decs cascades to Row 3 fused_prefilter)
  plan_decs_reverse      → plan_reverse_patterns + Decs adapter
                           (backward-walk rows already array_source-gated)
  plan_decs_distinct     → plan_distinct_patterns + Decs adapter
                           (emit_hashtable_dedup carries per-adapter take(N)
                            branch — Decs for_each_archetype_find, Array break)
  All 4 imperative bodies hard-deleted: -1173 LOC, +492 LOC = -681 LOC net.

Skip-into-tail preserved (emit_decs_reverse_skip_into_tail):
  Decs `reverse |> take(N) |> to_array` fast path lifted into a dedicated
  emit fn that emit_reverse_buffer_inplace pre-checks before the general
  buffer path. Preserves PR GaijinEntertainment#2834's 5.2× perf gain on multi-archetype
  decs sources.

Test fixes (tests/linq/test_linq_from_decs.das):
  6 splice-shape assertions updated to match unified naming:
    decs_buf → order_buf (order family) / `buf  (distinct)
    decs_seen → order_seen (order family) / `seen (distinct)
    decs_best → order_best, decs_taken → `taken, decs_acc → `acc

Verification:
  - mcp__daslang__lint: clean
  - test_linq_from_decs (198), test_linq_fold (385), test_linq_fold_ast (228)
  - all themes 2/3/6/7/8 + pattern_walker + collapse_chained_wheres
  - test_queries, test_queries_comprehensive (decs core)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@borisbat borisbat deleted the bbatkin/linq-fold-trivial-let-and-reverse-take branch May 30, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants