Skip to content

perry-codegen: close image_convolution 5%-target gap (377 ms → ≤281 ms) via IR-shape work for LLVM auto-vec #436

@proggeramlug

Description

@proggeramlug

Summary

After aed21292 (static trip-count for-loop unroll + smart channel-SIMD gate), image_convolution median is 377 ms vs the v0.5.81 baseline of 268 ms — +40.5 % over baseline, 5%-target is ≤281 ms. Perry beats Rust on this hardware (377 vs 414 ms) and is faster than every other interpreter by 3–4×; the only language ahead is Zig at 247 ms. Both Zig and perry hit the same LLVM backend; the gap comes from LLVM's auto-vectorizer firing on Zig's IR shape and not firing on perry's.

Why LLVM auto-vec fires for Zig but not perry

Inspecting otool -tv on the Zig binary shows the kernel uses dup.4s, mul.4s, umull2.2d, add.4s, tbl.16b — outer-loop vectorization across 4 adjacent output pixels with NEON 128-bit vectors. Perry's optimized IR has only scalar mul i32 / add i32 chains.

Three structural differences in the IR shape that block LLVM's vectorizer on perry's output:

  1. clampIdx lowers as a dowhile/break chain with a 3-way phi. Zig's inline fn clamp_idx flattens to select(v<lo, lo, select(v>hi, hi, v)). Perry already has the right path in lower_expr_as_i32 (crates/perry-codegen/src/expr.rs:10089 — emits @llvm.smin.i32/smax.i32) but it only fires when an i32 result is required by the surrounding context. The HIR inliner in crates/perry-hir/src/lower.rs always emits the dowhile pattern.

  2. Double mirror stores on integer-stable mutable locals. After the recent forward-closure pass + i32-init path landed, most integer locals have BOTH an i32 slot AND a double slot, both updated on every write. The double mirror is opaque memory traffic that LLVM treats as preventing vectorization. Many integer locals never have a double reader — those mirror stores are dead but kept for safety.

  3. Buffer-read alias metadata is per-load, not per-loop. Uint8ArrayGet's GEP+@llvm.assume+load i8 (expr.rs:5947+) emits per-access !alias.scope/!noalias via buffer_alias_metadata_suffix. Inside a hot loop reading the source buffer and writing the destination, the source-read scope and dest-write scope are disjoint sets — but LLVM's vectorizer struggles to prove this when the per-load metadata is independently scoped.

Proposed plan (Option A: codegen IR-shape only, no intrinsics)

Three focused changes:

  1. Lower clamp-pattern functions directly to select(smin/smax) at every call site, not just int-required contexts. Either teach lower_call to recognize calls into clamp3_functions / clamp_u8_functions and emit the chain inline, or add a HIR pass that rewrites the dowhile-from-clamp pattern back to a Conditional ternary.

  2. Add a collect_double_readers(stmts) -> HashSet<u32> analysis that reports which integer-stable locals genuinely need their double slot maintained. Locals not in that set can drop the mirror at Stmt::Let:527, LocalSet:857 (in expr.rs), and Update. Needs fixed-point iteration (transitive double-only chains).

  3. Pair source/dest buffers in the same noalias list when they're used in the same loop body — buffer_alias_base and buffer_data_slots should allocate scope ids in disjoint sets, not independent scopes.

Acceptance criteria

  • image_convolution 5-run median ≤ 281 ms (preferably ≤ 268 ms to clear σ).
  • Output checksum 2ba2e053 (matches Bun byte-for-byte).
  • ECS workloads still pass (query-perf 500, sync-hotpath 4000, perf-comprehensive 10k+migration).
  • All suite benchmarks unchanged.
  • json_pipeline_full stays at -27 % vs baseline (no regression from the IR-shape changes).
  • 15-run honest_bench (HONEST_BENCH_WARMUP=5 HONEST_BENCH_MEASURED=15) stable σ.

Per-language ranking on M1 Max (5-run median)

  • image_conv: zig 247, perry 377, rust 414, bun 949, node 1288

Related

Estimated 3–5 days of focused codegen work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceRuntime or compile-time performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions