Summary
After aed21292 (static trip-count for-loop unroll + smart channel-SIMD gate), image_convolution median is 377 ms vs the v0.5.81 baseline of 268 ms — +40.5 % over baseline, 5%-target is ≤281 ms. Perry beats Rust on this hardware (377 vs 414 ms) and is faster than every other interpreter by 3–4×; the only language ahead is Zig at 247 ms. Both Zig and perry hit the same LLVM backend; the gap comes from LLVM's auto-vectorizer firing on Zig's IR shape and not firing on perry's.
Why LLVM auto-vec fires for Zig but not perry
Inspecting otool -tv on the Zig binary shows the kernel uses dup.4s, mul.4s, umull2.2d, add.4s, tbl.16b — outer-loop vectorization across 4 adjacent output pixels with NEON 128-bit vectors. Perry's optimized IR has only scalar mul i32 / add i32 chains.
Three structural differences in the IR shape that block LLVM's vectorizer on perry's output:
-
clampIdx lowers as a dowhile/break chain with a 3-way phi. Zig's inline fn clamp_idx flattens to select(v<lo, lo, select(v>hi, hi, v)). Perry already has the right path in lower_expr_as_i32 (crates/perry-codegen/src/expr.rs:10089 — emits @llvm.smin.i32/smax.i32) but it only fires when an i32 result is required by the surrounding context. The HIR inliner in crates/perry-hir/src/lower.rs always emits the dowhile pattern.
-
Double mirror stores on integer-stable mutable locals. After the recent forward-closure pass + i32-init path landed, most integer locals have BOTH an i32 slot AND a double slot, both updated on every write. The double mirror is opaque memory traffic that LLVM treats as preventing vectorization. Many integer locals never have a double reader — those mirror stores are dead but kept for safety.
-
Buffer-read alias metadata is per-load, not per-loop. Uint8ArrayGet's GEP+@llvm.assume+load i8 (expr.rs:5947+) emits per-access !alias.scope/!noalias via buffer_alias_metadata_suffix. Inside a hot loop reading the source buffer and writing the destination, the source-read scope and dest-write scope are disjoint sets — but LLVM's vectorizer struggles to prove this when the per-load metadata is independently scoped.
Proposed plan (Option A: codegen IR-shape only, no intrinsics)
Three focused changes:
-
Lower clamp-pattern functions directly to select(smin/smax) at every call site, not just int-required contexts. Either teach lower_call to recognize calls into clamp3_functions / clamp_u8_functions and emit the chain inline, or add a HIR pass that rewrites the dowhile-from-clamp pattern back to a Conditional ternary.
-
Add a collect_double_readers(stmts) -> HashSet<u32> analysis that reports which integer-stable locals genuinely need their double slot maintained. Locals not in that set can drop the mirror at Stmt::Let:527, LocalSet:857 (in expr.rs), and Update. Needs fixed-point iteration (transitive double-only chains).
-
Pair source/dest buffers in the same noalias list when they're used in the same loop body — buffer_alias_base and buffer_data_slots should allocate scope ids in disjoint sets, not independent scopes.
Acceptance criteria
image_convolution 5-run median ≤ 281 ms (preferably ≤ 268 ms to clear σ).
- Output checksum
2ba2e053 (matches Bun byte-for-byte).
- ECS workloads still pass (query-perf 500, sync-hotpath 4000, perf-comprehensive 10k+migration).
- All suite benchmarks unchanged.
json_pipeline_full stays at -27 % vs baseline (no regression from the IR-shape changes).
- 15-run honest_bench (
HONEST_BENCH_WARMUP=5 HONEST_BENCH_MEASURED=15) stable σ.
Per-language ranking on M1 Max (5-run median)
- image_conv: zig 247, perry 377, rust 414, bun 949, node 1288
Related
Estimated 3–5 days of focused codegen work.
Summary
After
aed21292(static trip-count for-loop unroll + smart channel-SIMD gate),image_convolutionmedian is 377 ms vs the v0.5.81 baseline of 268 ms — +40.5 % over baseline, 5%-target is ≤281 ms. Perry beats Rust on this hardware (377 vs 414 ms) and is faster than every other interpreter by 3–4×; the only language ahead is Zig at 247 ms. Both Zig and perry hit the same LLVM backend; the gap comes from LLVM's auto-vectorizer firing on Zig's IR shape and not firing on perry's.Why LLVM auto-vec fires for Zig but not perry
Inspecting
otool -tvon the Zig binary shows the kernel usesdup.4s,mul.4s,umull2.2d,add.4s,tbl.16b— outer-loop vectorization across 4 adjacent output pixels with NEON 128-bit vectors. Perry's optimized IR has only scalarmul i32/add i32chains.Three structural differences in the IR shape that block LLVM's vectorizer on perry's output:
clampIdxlowers as adowhile/breakchain with a 3-way phi. Zig'sinline fn clamp_idxflattens toselect(v<lo, lo, select(v>hi, hi, v)). Perry already has the right path inlower_expr_as_i32(crates/perry-codegen/src/expr.rs:10089— emits@llvm.smin.i32/smax.i32) but it only fires when an i32 result is required by the surrounding context. The HIR inliner incrates/perry-hir/src/lower.rsalways emits the dowhile pattern.Double mirror stores on integer-stable mutable locals. After the recent forward-closure pass + i32-init path landed, most integer locals have BOTH an i32 slot AND a double slot, both updated on every write. The double mirror is opaque memory traffic that LLVM treats as preventing vectorization. Many integer locals never have a double reader — those mirror stores are dead but kept for safety.
Buffer-read alias metadata is per-load, not per-loop.
Uint8ArrayGet's GEP+@llvm.assume+load i8(expr.rs:5947+) emits per-access!alias.scope/!noaliasviabuffer_alias_metadata_suffix. Inside a hot loop reading the source buffer and writing the destination, the source-read scope and dest-write scope are disjoint sets — but LLVM's vectorizer struggles to prove this when the per-load metadata is independently scoped.Proposed plan (Option A: codegen IR-shape only, no intrinsics)
Three focused changes:
Lower clamp-pattern functions directly to
select(smin/smax)at every call site, not just int-required contexts. Either teachlower_callto recognize calls intoclamp3_functions/clamp_u8_functionsand emit the chain inline, or add a HIR pass that rewrites the dowhile-from-clamp pattern back to aConditionalternary.Add a
collect_double_readers(stmts) -> HashSet<u32>analysis that reports which integer-stable locals genuinely need their double slot maintained. Locals not in that set can drop the mirror atStmt::Let:527,LocalSet:857(in expr.rs), andUpdate. Needs fixed-point iteration (transitive double-only chains).Pair source/dest buffers in the same
noaliaslist when they're used in the same loop body —buffer_alias_baseandbuffer_data_slotsshould allocate scope ids in disjoint sets, not independent scopes.Acceptance criteria
image_convolution5-run median ≤ 281 ms (preferably ≤ 268 ms to clear σ).2ba2e053(matches Bun byte-for-byte).json_pipeline_fullstays at -27 % vs baseline (no regression from the IR-shape changes).HONEST_BENCH_WARMUP=5 HONEST_BENCH_MEASURED=15) stable σ.Per-language ranking on M1 Max (5-run median)
Related
cf540b92,f02541a4,33d8dc41,3f5c69a5,4f895dd8,817c4b56,aed21292.Estimated 3–5 days of focused codegen work.