Summary
Perry's GC includes a "block-persistence" pass at crates/perry-runtime/src/gc.rs:1090 that marks ALL objects in an 8 MB arena block live whenever any single object in the block is root-reachable. This was added to protect against untracked arena refs sitting in caller-saved registers that the conservative stack scan can't capture (issues #43 / #44 pattern — dangling-pointer crashes when GC freed still-reachable arena objects).
The conservatism is catastrophic for tight allocation loops that co-locate fresh per-iteration data with pre-existing long-lived state. The JSON.parse case was traced in detail in #149; summary of the mechanism:
- Block 0 (and sometimes block 1) contains long-lived data: interned string keys, shape-cache
keys_arrays, the blob being parsed, any caller-level root arrays.
- The same block also contains early-iteration allocations (the first
new Foo() in the loop always lands adjacent to the setup data).
- Every subsequent iteration's allocations go into new blocks.
- GC marks block 0 live (because of the long-lived data). Block-persist marks EVERY object in block 0 live — including the dead early-iteration objects.
- Those dead objects have field values pointing into later blocks (fresh name strings, tag arrays, nested objects).
- The persist pass iterates to fixed point, following pointers across blocks — at iter 49 of
bench_json_roundtrip, 2 live blocks / 37 truly-reachable objects cascaded to 30 live blocks / 3 M "live" objects over 11 rounds.
gc() becomes a no-op. RSS grows linearly with iteration count.
Measured impact (bench_json_roundtrip at v0.5.190)
|
Perry |
Bun |
ratio |
| Speed |
315 ms |
248 ms |
1.27× |
| Peak RSS |
318 MB |
83 MB |
3.83× |
The speed gap is tolerable (recently closed from 2.4× to 1.27×, see #149 closeout). The RSS gap is the architectural cost of block-persist + bump-arena vs. Bun's per-object GC.
Not just JSON
Any tight new ClassName() / new Array() / tight parser loop that runs after meaningful setup hits the same pattern. The setup's long-lived roots anchor block 0 live; the first loop iteration allocates adjacent to them; block-persist pulls in the dead iter-0 objects on every GC and cascades from there.
We've seen the contours of this before:
Fix directions
Ranked by scope.
A. Segregated long-lived arena region (medium)
Allocate intrinsically long-lived data — PARSE_KEY_CACHE strings, shape-cache keys_arrays, transition-cache arrays, the string intern table, stringify scratch — into a DEDICATED arena block (or a small fixed-size region) that block-persist can conservatively retain. Everything else goes into the general arena where per-iter blocks genuinely go dead together.
Small scope (no correctness tradeoff with #43 / #44), but needs careful routing: every long-lived allocation path has to opt in.
B. Weaken block-persist with a stricter pre-condition (high)
Block-persist exists because the conservative stack scan can miss handles in caller-saved registers during mid-parse GC triggers. If we can guarantee (by convention + scaffolding in js_json_parse, js_buffer_alloc, etc.) that all intermediate arena refs from a given path are tracked in an explicit root set (like PARSE_ROOTS) before any internal allocation, we can skip block-persist for those paths.
Would need a comprehensive audit of allocation call sites. Higher risk — re-opens the failure mode #43 / #44 closed.
C. Generational GC (largest)
Young generation = throwaway per-GC-cycle region; old generation = current arena model. Most parse output is trivially young-generation (dies before first GC). Matches what Bun/V8 do.
Weeks of work, changes allocator semantics throughout the runtime.
D. Do nothing (documented trade-off)
Ship Perry's current numbers as-is. Speed is already a win over Node. RSS gap is the cost of the arena model's simplicity. Document it, close the door.
What I'd do next
(A) looks like the right first step — "long-lived data gets its own quarantine block" is a conceptually clean, well-bounded change. Implementing that for just PARSE_KEY_CACHE + shape-cache arrays would prove the concept on the JSON workload and inform whether to extend.
References
Summary
Perry's GC includes a "block-persistence" pass at
crates/perry-runtime/src/gc.rs:1090that marks ALL objects in an 8 MB arena block live whenever any single object in the block is root-reachable. This was added to protect against untracked arena refs sitting in caller-saved registers that the conservative stack scan can't capture (issues #43 / #44 pattern — dangling-pointer crashes when GC freed still-reachable arena objects).The conservatism is catastrophic for tight allocation loops that co-locate fresh per-iteration data with pre-existing long-lived state. The
JSON.parsecase was traced in detail in #149; summary of the mechanism:keys_arrays, the blob being parsed, any caller-level root arrays.new Foo()in the loop always lands adjacent to the setup data).bench_json_roundtrip, 2 live blocks / 37 truly-reachable objects cascaded to 30 live blocks / 3 M "live" objects over 11 rounds.gc()becomes a no-op. RSS grows linearly with iteration count.Measured impact (
bench_json_roundtripat v0.5.190)The speed gap is tolerable (recently closed from 2.4× to 1.27×, see #149 closeout). The RSS gap is the architectural cost of block-persist + bump-arena vs. Bun's per-object GC.
Not just JSON
Any tight
new ClassName()/new Array()/ tight parser loop that runs after meaningful setup hits the same pattern. The setup's long-lived roots anchor block 0 live; the first loop iteration allocates adjacent to them; block-persist pulls in the dead iter-0 objects on every GC and cascades from there.We've seen the contours of this before:
items.push({...})builds that produce megabytes of structure but onlylengthis read laterBuffer.allocwithconst buf = Buffer.alloc(1024)in a loop (before the perf(runtime): fast-path small Buffer.alloc via per-thread bump slab #173 slab fast-path)new Map()build loopsFix directions
Ranked by scope.
A. Segregated long-lived arena region (medium)
Allocate intrinsically long-lived data —
PARSE_KEY_CACHEstrings, shape-cachekeys_arrays, transition-cache arrays, the string intern table, stringify scratch — into a DEDICATED arena block (or a small fixed-size region) that block-persist can conservatively retain. Everything else goes into the general arena where per-iter blocks genuinely go dead together.Small scope (no correctness tradeoff with #43 / #44), but needs careful routing: every long-lived allocation path has to opt in.
B. Weaken block-persist with a stricter pre-condition (high)
Block-persist exists because the conservative stack scan can miss handles in caller-saved registers during mid-parse GC triggers. If we can guarantee (by convention + scaffolding in
js_json_parse,js_buffer_alloc, etc.) that all intermediate arena refs from a given path are tracked in an explicit root set (likePARSE_ROOTS) before any internal allocation, we can skip block-persist for those paths.Would need a comprehensive audit of allocation call sites. Higher risk — re-opens the failure mode #43 / #44 closed.
C. Generational GC (largest)
Young generation = throwaway per-GC-cycle region; old generation = current arena model. Most parse output is trivially young-generation (dies before first GC). Matches what Bun/V8 do.
Weeks of work, changes allocator semantics throughout the runtime.
D. Do nothing (documented trade-off)
Ship Perry's current numbers as-is. Speed is already a win over Node. RSS gap is the cost of the arena model's simplicity. Document it, close the door.
What I'd do next
(A) looks like the right first step — "long-lived data gets its own quarantine block" is a conceptually clean, well-bounded change. Implementing that for just
PARSE_KEY_CACHE+ shape-cache arrays would prove the concept on the JSON workload and inform whether to extend.References
crates/perry-runtime/src/gc.rsmark_stack_rootscrates/perry-runtime/src/gc.rs:1090mark_block_persisting_arena_objectscrates/perry-runtime/src/arena.rs:592arena_reset_empty_blocks