Implement perry-container and perry-container-compose modules#2
Implement perry-container and perry-container-compose modules#2yumin-chen wants to merge 3 commits intofeat/container-composefrom
Conversation
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
d0be721 to
093e7a0
Compare
093e7a0 to
bd88aba
Compare
b623b2f to
bd88aba
Compare
d3d0b0a to
7396c20
Compare
4b72520 to
4cda64d
Compare
247b2b9 to
74af827
Compare
4204a2b to
4537ed2
Compare
Single-constant change (BLOCK_SIZE in arena.rs) that re-tunes the arena for the post-v0.5.193 GC. Codegen's inline bump allocator reads block size from InlineArenaState at runtime, so no IR changes — just a different allocation granularity. Measured on bench_json_roundtrip (best-of-5, macOS ARM64): v0.5.193 (8 MB blocks): 384 ms / 213 MB v0.5.194 (1 MB blocks): 322 ms / 199 MB [-16% time, -7% RSS] Perry now beats Node on both axes: Node: 372 ms / 191 MB Perry: 322 ms / 199 MB [-13% time, +4% RSS] Still trails Bun (248 ms / 83 MB); the remaining gap is structural (tier 2/3 work per docs/memory-perf-roadmap.md). The surprise was the TIME win. Smaller blocks = arena reaches the GC threshold sooner on the first iteration = adaptive step halves earlier = the 60-80% freed-pct this bench produces actually drives productive reclaim instead of sitting on a too-high step until the workload ends. RSS win was smaller than projected because the bulk of arena bytes isn't the 5-block recent-safety window (now 5 MB instead of 40 MB), it's the allocation headroom between GCs, which scales with the adaptive step, not block size. Swept 512 KB, 1 MB, 2 MB. 1 MB is the sweet spot: RSS essentially tied with 512 KB, block-count overhead 2× smaller. Regression scan clean across 7 benches (object_create, binary_trees, loop_overhead, math_intensive, gc_pressure, array_write, array_grow) — all identical to v0.5.193. Gap tests 24/28 unchanged. Runtime tests 124/124. New docs/memory-perf-roadmap.md captures the strategic plan for beating Bun on both axes: - Tier 1 (days): #1 block size (this commit), #2 SSO, #3 SIMD JSON - Tier 2 (weeks): escape analysis, precise root tracking - Tier 3 (month+): generational GC, compacting GC
…(v0.5.197) Add SIMD string-terminator scan to json.rs::DirectParser::parse_string_bytes. 16-byte chunk scan for " or \ with scalar tail. Target-gated: aarch64 → vdupq_n_u8 / vceqq_u8 / vmaxvq_u8 / vst1q_u8 x86_64 → _mm_cmpeq_epi8 / _mm_movemask_epi8 / trailing_zeros other → scalar Measured on a long-string synthetic (100+ char strings, 5k records × 30 iters): Scalar: 92-102 ms NEON: 75-77 ms (-18%) bench_json_roundtrip UNCHANGED at 316-322 ms / 199 MB because this bench's strings are all <16 bytes — the SIMD body loop never executes, every string hits the scalar tail. Tier 1 #3's projected 2-4× speedup requires the simdjson-style structural scan (finding {}[],:" positions in one sweep), which is a substantial DirectParser rewrite. Deferred per roadmap — SSO (tier 1 #2) is more impactful on short-string workloads because it reduces allocation-path cost. The SIMD infrastructure here still matters for real-world JSON (API responses, logs, prose) where values are typically 20-80 bytes. No regressions: 7 reference benches identical, gap tests 24/28 unchanged, runtime tests 124/124.
…B (v0.5.198) Perry now beats Node on BOTH axes of bench_json_roundtrip. Single-constant change: GC_THRESHOLD_INITIAL_BYTES 128 → 64 MB. The 128 MB initial threshold was tuned around 07_object_create's 96 MB working set (fits under threshold → 0 GC cost on a 1M-iter tight hot loop). That tuning was wrong for any workload with sustained allocation pressure: bench_json_roundtrip at 5 MB/iter only hit the 128 MB trigger once per bench run (iter ~15), and after v0.5.193's adaptive step the workload's 92%-freed first cycle got read as "back off" and doubled the step to 256 MB — the bench completes before a second GC fires. Lowering to 64 MB fires the first GC at iter ~12 so the second cycle lands at ~160 MB which the 50-iter bench does reach. Tuning sweep (best-of-5, macOS ARM64): 128 MB (v0.5.195): 322 ms / 199 MB (speed ✅, RSS ❌ vs Node) 96 MB: 353 ms / 178 MB 64 MB: 373 ms / 144 MB (wins both axes vs Node) 48 MB: 378 ms / 130 MB (time breaks even with Node) Picked 64 MB. Final Perry vs Node 25.8.0: Time: 373 ms vs 385 ms (-3%) RSS: 144 MB vs 188 MB (-23%) Still trails Bun 1.3.12 (250 ms / 83 MB) by ~1.5× on both. Closing that gap requires tier 2/3 architectural work per docs/memory-perf- roadmap.md (escape analysis, precise root tracking, generational GC). Tier 1 #2 (SSO) explored and deferred: 90 runtime/stdlib/codegen files touch strings, multi-day invasive change, ~30 MB potential savings. Compounds better after generational GC lands (remaining short-string allocations become young-gen garbage). Regression scan clean: 07_object_create 0-1 ms / 6.4 MB (working set fits in one 1 MB block, well under 64 MB threshold), 12_binary_trees same, 02_loop_overhead 12-14 ms, 06_math_intensive 14-15 ms, bench_gc_pressure 17-18 ms, bench_array_grow 12-14 ms. Gap tests 24/28 unchanged. Runtime tests 124/124.
d014e17 to
69dd7db
Compare
Tier 1 #2 per docs/memory-perf-roadmap.md. Small String Optimization lets strings of length 0..5 bytes encode inline in the 48-bit NaN-box payload instead of allocating a StringHeader. INFRASTRUCTURE-ONLY landing. No creation sites migrated yet — see docs/sso-migration-plan.md for the 6-step roll-out sequence with per-step ship criteria. Why infrastructure-first: a single-commit flip of DirectParser::parse_string_value to emit SSO immediately regressed 3 test_json_lazy_*.ts tests. The consumer surface for strings in Perry is large — json.rs alone has 20+ `== STRING_TAG` dispatches, and the broader fan-out covers object.rs property-get helpers, string.rs methods, regex.rs, set.rs / map.rs key equality, stdlib HTTP/DB paths, and codegen string-literal emission. Landing the infrastructure without producers is safe (the new tag value is allocated but unused) and unblocks incremental per-site migration. Added: - SHORT_STRING_TAG = 0x7FF9_0000_0000_0000 (value.rs) - JSValue::{try_short_string, short_string_to_buf, short_string_len, short_string_unchecked} - JSValue::{is_short_string, is_any_string} — legacy is_string() stays strict (heap pointer only) so the existing ~50 callers that follow is_string() with as_string_ptr() don't need to be audited yet - js_string_new_sso(ptr, len) -> f64 (string.rs) — SSO-aware creation, falls back to heap on len > 5 - str_bytes_from_jsvalue(value, &mut scratch) (string.rs) — central decoder producing (*const u8, u32) for either form - js_string_materialize_to_heap(value) (string.rs) — compatibility shim for callers that truly need *mut StringHeader Consumer-side dispatch already wired: - typeof (builtins.rs) accepts both tags - js_jsvalue_equals (value.rs) — SSO fast path when both operands are SSO (canonical encoding ⇒ same bytes ⇒ same bits), decode via scratch buffers otherwise - js_jsvalue_compare (value.rs) — lexicographic comparison via decoded byte slices - js_value_length_f64 (value.rs) — direct bit extraction for SSO, no heap access - js_jsvalue_to_string (value.rs) — materializes SSO to heap when caller needs *mut StringHeader - Three stringify arms in json.rs (stringify_value, stringify_object_inner field dispatch, stringify_array_depth element dispatch) — the remaining 15+ arms are Step 1 of the migration plan 6 new unit tests in value::tests (total 130 → 136): - roundtrip across 0, 1, 2, 3, 4, 5-byte inputs - rejection of 6+ byte inputs (returns None from try_short_string) - embedded-NUL handling (length is authoritative, NULs are data) - tag-band distinctness from POINTER / INT32 / NUMBER / UNDEFINED - empty-string roundtrip - byte-order stability (first byte lands in LSB — invariant for any future SIMD bulk-decoder) Full regression sweep verifies infrastructure-only is safe: - All 10 test_json_*.ts match Node byte-for-byte - Runtime tests 136/136 - Workspace cargo test unaffected docs/sso-migration-plan.md sequences the roll-out: Step 1: stringify consumers (json.rs, ~15 sites) Step 2: DirectParser emits SSO Step 3: object key storage (object.rs, PARSE_KEY_CACHE, shape cache) Step 4: string methods (string.rs) Step 5: codegen string literals (compile-time constants) Step 6: stdlib HTTP / DB paths + decision gate after Step 2 to re-evaluate vs jumping to tier 2/3
88c7924 to
dcfe610
Compare
Implement the perry/container and perry/container-compose modules, including a refactored Rust orchestration engine, OCI backend discovery, security verification, and compiler integration.
PR created automatically by Jules for task 15265381819452015182 started by @yumin-chen