perry-codegen: close image_convolution 5%-target gap (377 ms → ≤281 ms) via IR-shape work for LLVM auto-vec

## Summary

After `aed21292` (static trip-count for-loop unroll + smart channel-SIMD gate), `image_convolution` median is 377 ms vs the v0.5.81 baseline of 268 ms — **+40.5 % over baseline, 5%-target is ≤281 ms**. Perry beats Rust on this hardware (377 vs 414 ms) and is faster than every other interpreter by 3–4×; the only language ahead is Zig at 247 ms. Both Zig and perry hit the same LLVM backend; the gap comes from LLVM's auto-vectorizer firing on Zig's IR shape and not firing on perry's.

## Why LLVM auto-vec fires for Zig but not perry

Inspecting `otool -tv` on the Zig binary shows the kernel uses `dup.4s`, `mul.4s`, `umull2.2d`, `add.4s`, `tbl.16b` — outer-loop vectorization across 4 adjacent output pixels with NEON 128-bit vectors. Perry's optimized IR has only scalar `mul i32` / `add i32` chains.

Three structural differences in the IR shape that block LLVM's vectorizer on perry's output:

1. **`clampIdx` lowers as a `dowhile/break` chain with a 3-way phi.** Zig's `inline fn clamp_idx` flattens to `select(v<lo, lo, select(v>hi, hi, v))`. Perry already has the right path in `lower_expr_as_i32` (`crates/perry-codegen/src/expr.rs:10089` — emits `@llvm.smin.i32`/`smax.i32`) but it only fires when an i32 result is required by the surrounding context. The HIR inliner in `crates/perry-hir/src/lower.rs` always emits the dowhile pattern.

2. **Double mirror stores on integer-stable mutable locals.** After the recent forward-closure pass + i32-init path landed, most integer locals have BOTH an i32 slot AND a double slot, both updated on every write. The double mirror is opaque memory traffic that LLVM treats as preventing vectorization. Many integer locals never have a double reader — those mirror stores are dead but kept for safety.

3. **Buffer-read alias metadata is per-load, not per-loop.** `Uint8ArrayGet`'s GEP+`@llvm.assume`+`load i8` (`expr.rs:5947+`) emits per-access `!alias.scope`/`!noalias` via `buffer_alias_metadata_suffix`. Inside a hot loop reading the source buffer and writing the destination, the source-read scope and dest-write scope are disjoint sets — but LLVM's vectorizer struggles to prove this when the per-load metadata is independently scoped.

## Proposed plan (Option A: codegen IR-shape only, no intrinsics)

Three focused changes:

1. **Lower clamp-pattern functions directly to `select(smin/smax)` at every call site, not just int-required contexts.** Either teach `lower_call` to recognize calls into `clamp3_functions` / `clamp_u8_functions` and emit the chain inline, or add a HIR pass that rewrites the dowhile-from-clamp pattern back to a `Conditional` ternary.

2. **Add a `collect_double_readers(stmts) -> HashSet<u32>` analysis** that reports which integer-stable locals genuinely need their double slot maintained. Locals not in that set can drop the mirror at `Stmt::Let:527`, `LocalSet:857` (in expr.rs), and `Update`. Needs fixed-point iteration (transitive double-only chains).

3. **Pair source/dest buffers in the same `noalias` list** when they're used in the same loop body — `buffer_alias_base` and `buffer_data_slots` should allocate scope ids in disjoint sets, not independent scopes.

## Acceptance criteria

- `image_convolution` 5-run median ≤ 281 ms (preferably ≤ 268 ms to clear σ).
- Output checksum `2ba2e053` (matches Bun byte-for-byte).
- ECS workloads still pass (query-perf 500, sync-hotpath 4000, perf-comprehensive 10k+migration).
- All suite benchmarks unchanged.
- `json_pipeline_full` stays at -27 % vs baseline (no regression from the IR-shape changes).
- 15-run honest_bench (`HONEST_BENCH_WARMUP=5 HONEST_BENCH_MEASURED=15`) stable σ.

## Per-language ranking on M1 Max (5-run median)

- image_conv: zig 247, **perry 377**, rust 414, bun 949, node 1288

## Related

- Earlier session work on this gap: commits `cf540b92`, `f02541a4`, `33d8dc41`, `3f5c69a5`, `4f895dd8`, `817c4b56`, `aed21292`.
- The verify-output discipline from #PERRY-LESSONS: every iteration must diff against Bun.
- Issue #435 (integer_locals overflow) overlaps this slightly but is a correctness issue, not a perf one. Don't merge them.

Estimated 3–5 days of focused codegen work.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perry-codegen: close image_convolution 5%-target gap (377 ms → ≤281 ms) via IR-shape work for LLVM auto-vec #436

Summary

Why LLVM auto-vec fires for Zig but not perry

Proposed plan (Option A: codegen IR-shape only, no intrinsics)

Acceptance criteria

Per-language ranking on M1 Max (5-run median)

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

perry-codegen: close image_convolution 5%-target gap (377 ms → ≤281 ms) via IR-shape work for LLVM auto-vec #436

Description

Summary

Why LLVM auto-vec fires for Zig but not perry

Proposed plan (Option A: codegen IR-shape only, no intrinsics)

Acceptance criteria

Per-language ranking on M1 Max (5-run median)

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions