v2.8.21
Three IR codegen wins compounding to a 5.2% bootstrap shrink (1.24 MB → 1.18 MB) and 30% sort runtime drop (153 → 108 ms — krc now beats gcc -O2 on bubble-sort by 2.5×).
What changed
-
6th register colour (rbp). Graph-colouring regalloc gained one more callee-saved register, dropping spill rate compiler-wide. rbp had been left out historically; the lz4 / fat-archive paths surfaced an off-by-one in stack-arg overflow loads — replaced `ir_frame_size + 48` (hardcoded "5 pushes + ret addr") with `ir_frame_size + ir_callee_save_bytes + 8`.
-
Per-function used-callee-save prologue. Functions push only the colours regalloc actually assigned. fib's prologue dropped from 5 pushes to 3; leaf-ish helpers often drop to 0-1. Variable alignment math (push_count parity decides frame_size +8) keeps SP 16-aligned at every CALL.
-
Cross-register spill-reload peephole. `store rax,V; load rcx,V` (different reg) now emits `mov rcx, rax` instead of a stack roundtrip. Catches matmul-style intermediate-vreg flows through different scratch regs.
Runtime delta (Ryzen 9 7900X)
| bench | v2.8.20 | v2.8.21 | gcc -O2 | krc Δ |
|---|---|---|---|---|
| fib | 442 ms | 427 ms | 78 ms | -3% |
| sort | 153 ms | 108 ms | 270 ms | -30%, 2.5× ahead of gcc -O2 |
| sieve | 3 ms | 3 ms | 2 ms | tied |
| matmul | 34 ms | 33 ms | 4 ms | -3% |
Verified: bootstrap fixed point at 1,176,168 bytes; 439/439 tests pass.
Next on the optimization roadmap
- matmul still 8× behind gcc -O2 — needs loop strength reduction to remove per-iter address recomputation (~10 of ~16 inner-loop insns are `(i*N+j)*8` calculations the compiler should hoist as a running pointer).
- fib still 5× behind — gcc -O2 inlines fib 4-5 levels deep before materialising leaves as real `call`. Recursive inlining at the IR level.
Full Changelog: v2.8.20...v2.8.21
Full Changelog: v2.8.20...v2.8.21