perf: compact bytecode encoding (1.0-1.2x VM speedup, 7.9x on field access)#3457
Conversation
…de, encoding pass) Define 71-variant OpCode enum with 1-byte opcodes and fixed operands, implement two-pass encoder that transforms Vec<Instruction> into Vec<u8>, and wire encoding into engine loading so all functions get compact bytecode populated at load time. The old dispatch path is unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…p_compact) Implement exec_compact() and step_compact() dispatch methods that read opcodes from the compact Vec<u8> stream. Add faulting_pc to BytecodeFrame for correct PC recovery in exception handling and stack traces. Extract shared exec_cmpop() and exec_binop() helpers. All existing tests now execute through the compact dispatch path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e 3) Add display_compact_bytecode() that walks the compact Vec<u8> stream and produces human-readable disassembly with byte offsets, opcode mnemonics, operands, and resolved jump targets. Add Display impl for OpCode mapping each variant to its SCREAMING_SNAKE_CASE mnemonic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The compact dispatch was cloning the entire code buffer (compact.code.clone()) for every operand read due to a borrow checker workaround. Since `function` is &'static Function, the code slice lifetime doesn't conflict with &mut self. Fix: extract `code: &'static [u8]` once at the top of step_compact and pass it directly to read_u32/read_i32/read_i8, eliminating 37 heap allocations per instruction dispatch. Also switch read_u32/read_i32 from 4 individual byte accesses to a single slice operation (one bounds check instead of four). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… path Since the compact bytecode is produced by our own encoder (with sizes verified by debug_assert in lower_to_compact), we know all byte reads are in bounds. Use get_unchecked to eliminate bounds checks on every operand read in step_compact, removing ~4 bounds checks per instruction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…coding-with-fused-opcodes
…eClosure, JumpTable) Adapt compact bytecode encoding to canary's new instruction set: - Map specialized arithmetic instructions (AddInt, SubFloat, etc.) to generic opcodes - Fix MakeClosure to single operand (capture_count now on stack) - Fix JumpTable from struct to tuple variant with default in JumpTableData - Update step_compact dispatch to match new ensure_pop() API (Value, not Result) - Handle _Pad variant with unreachable!() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructured the compact bytecode VM dispatch to match (and exceed) the performance of the baseline instruction-array dispatch. Key changes: - **Specialized opcodes**: Added dedicated AddInt/SubInt/CmpIntLt/etc. with unreachable_unchecked fast paths instead of mapping to generic Add/Sub that go through exec_binop's type dispatch. - **Transmute opcode byte**: Skip the 90-arm TryFrom<u8> validation match on every instruction. The bytecode is produced by our own encoder. - **Tight inner dispatch loop**: Keep pc/code/function/frame_idx as locals across many instructions. Only break out to re-extract from the frame on actual control-flow changes (Call, Return, Throw). Simple ops (arithmetic, load/store, jumps within a function) never touch the frame's instruction_ptr. - **Return updates frame_idx**: After self.frames.pop(), set frame_idx to the new top so the outer loop detects the frame change correctly. - **Dead-code legacy dispatch**: exec_inner unconditionally delegates to exec_compact, freeing icache pressure from the 2300-line step() function that's no longer reachable. Benchmark results vs baseline (bench-real-world): fib iterative 100k: 1.2x faster if-else dispatch 1M: 1.2x faster guard clauses 1M: 1.2x faster collatz 100k: 1.2x faster bubble sort 5k: 1.2x faster divide guard 1M: 1.2x faster nested field access: 7.9x faster (compact bytecode icache win) most others: 1.0-1.1x faster Also adds --only-baml flag to bench-real-world for faster A/B testing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds a compact bytecode representation, lowering full instruction bytecode into a compact 1-byte-opcode form at engine construction, plus a disassembler and helpers; also adds a CLI flag to restrict real‑world benchmarks to BAML only. (49 words) ChangesCompact Bytecode Encoding & Integration
Benchmark Script Enhancement
Sequence DiagramsequenceDiagram
participant Engine as BexEngine::new
participant Bytecode as Bytecode
participant Encoder as lower_to_compact()
participant Compact as CompactCode
participant Debug as display_compact_bytecode()
participant Decoder as OpCode::try_from
Engine->>Bytecode: iterate compile-time Function objects
Bytecode->>Encoder: call lower_to_compact()
Encoder->>Encoder: Pass 1: map instruction -> byte offset
Encoder->>Encoder: Pass 2: emit opcodes, operands, translate jump/table offsets
Encoder->>Compact: produce CompactCode (code, tables, metadata)
Bytecode->>Bytecode: set compact = Some(CompactCode)
Note over Debug,Compact: debug display of compact code
Debug->>Compact: iterate bytes by PC
Debug->>Decoder: read opcode byte -> OpCode
Decoder-->>Debug: OpCode variant
Debug->>Debug: decode operands (read_i8/i32/u32), resolve jump targets, format output
Debug->>Debug: write to fmt::Write buffer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@baml_language/crates/bex_vm_types/src/bytecode.rs`:
- Around line 694-697: The MAKE_CLOSURE opcode's compact format is inconsistent:
the enum comment says it has two operands (u32 object_idx + u32 capture_count)
but encoded_size(), lower_to_compact(), and the compact disassembler only
handle/emit one u32, causing misalignment; fix by choosing the canonical layout
(two u32s as documented) and update all places to match—adjust encoded_size() to
return the correct byte count for MakeClosure, modify lower_to_compact() to emit
both object_idx and capture_count for MakeClosure, and update the compact
disassembler/parser to read two u32 operands for MakeClosure (also apply the
same consistency fixes to the other occurrences noted around the alternative
ranges). Ensure JumpTable handling remains unchanged and run tests to verify
byte alignment after the first MAKE_CLOSURE.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c9e4e744-cfc5-43c6-a908-f1d6470128c6
📒 Files selected for processing (6)
baml_language/crates/bex_engine/src/lib.rsbaml_language/crates/bex_vm/src/debug.rsbaml_language/crates/bex_vm/src/errors.rsbaml_language/crates/bex_vm/src/vm.rsbaml_language/crates/bex_vm_types/src/bytecode.rsbaml_language/scripts/bench-real-world
Merging this PR will improve performance by 34.15%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | WallTime | vm_call_chain_100_x_5k |
36.6 ms | 33.1 ms | +10.68% |
| ⚡ | WallTime | vm_field_access_50k |
9.7 ms | 7.6 ms | +27.24% |
| ⚡ | WallTime | vm_nested_loop |
6.1 ms | 5.4 ms | +14.48% |
| ⚡ | WallTime | vm_array_iter_10k |
5.8 ms | 4.7 ms | +22.2% |
| ⚡ | WallTime | vm_closure_call_50k |
10.7 ms | 9.1 ms | +17.32% |
| ⚡ | WallTime | vm_array_push_50k |
9.7 ms | 8.2 ms | +19.03% |
| ⚡ | WallTime | vm_loop_500k |
45.4 ms | 35.7 ms | +27.15% |
| ⚡ | WallTime | vm_fib_20 |
5.5 ms | 4.7 ms | +16.95% |
| ⚡ | WallTime | vm_class_create_50k |
19.3 ms | 14.4 ms | +34.15% |
Comparing compact-bytecode-encoding-with-fused-opcodes (64b3dbc) with canary (3f888bf)
Binary size checks passed✅ 7 passed
Generated by |
The early_yield tests use BexVm::from_program directly (bypassing BexEngine which previously did the lowering). exec_compact would then panic on compact.as_ref().unwrap() with None. Move the lower_to_compact() call into convert_program so every caller path gets compact bytecode regardless of whether they go through BexEngine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The compact disassembler was reading 2 u32 operands for MakeClosure, but the compact encoding only emits 1 (obj_idx) — capture_count is popped from the stack at runtime (pushed by a preceding LoadConst). This caused the disassembler to consume the next instruction's opcode byte as the phantom "captures" operand, misaligning all subsequent output after the first MAKE_CLOSURE. Also update the misleading comment on the MakeClosure variant declaration to reflect the actual single-operand encoding. Note: encoded_size() (5 bytes) and lower_to_compact() (1 u32 emit) were already correct — only the disassembler diverged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ttypeof Reconcile reflect.type_of<T>() runtime substrate with canary's compact bytecode encoding (#3457). Canary made bytecode dispatch table-driven via fixed-size compact opcodes; reflect added struct-variant operands carrying ntypeargs/capture_count for generic forwarding. Both paths needed to co-exist: extend the compact format to thread the new fields end-to-end. - bex_vm_types/bytecode.rs: AllocInstance/Call grow to 7 bytes (+u16 ntypeargs); MakeClosure grows to 9 bytes (+u16 capture_count, +u16 ntypeargs). New OpCode::LoadType (5 bytes, u32 const idx). Updated encoder pass 2, encoded_size, OpCode::try_from, and Display name. - bex_vm/vm.rs: compact dispatch reads the new operand bytes. Call, AllocInstance, MakeClosure pop ntypeargs Object::Type values into frame.type_args / class_type_args / captured_type_args. New OpCode::LoadType handler substitutes TyTemplate against frame type_args. exec_cmpop gains an Object::Type arm for structural ==. IsType compact path handles ConstValue::ClassWithTypeArgs. - bex_vm/debug.rs: disassembler prints the new fields. - baml_compiler2_ast/lower_expr_body.rs: drop reflect's lower_match_pattern in favour of canary's PATTERN-CST lower_pattern family (reflect's only caller already migrated via auto-merge). - baml_compiler2_mir/lower.rs: alias TypeExpr→AstTypeExpr to match canary's import shape. - BytecodeFrame: keep type_args (pub) alongside canary's pub(crate) faulting_pc. - Snapshot regen: cargo insta accept (mostly line-shift in llm_types.baml from canary's stdlib edits, plus new patterns_new and type_reflection_strict snapshots).
Summary
Re-encodes the VM's
Vec<Instruction>(24-byte enum) into a compact byte streamVec<u8>with 1-byte opcodes and inline operands. Restructures dispatch to keeppc/codeas locals in a tight inner loop, only breaking out for actual control-flow changes.Result: 1.0–1.2x faster on most benchmarks, 7.9x faster on nested field access (compact code = better icache utilization).
Benchmark results vs baseline
Key changes
Encoding (
bex_vm_types/src/bytecode.rs)#[repr(u8)] OpCodeenum with ~90 variants (1/2/5/9-byte encoding groups)CompactCode { code: Vec<u8>, jump_tables: Vec<CompactJumpTable> }produced once per function at engine load timeAddInt,SubInt,CmpIntLt, etc.) get their own bytes — no type dispatch at runtimeDispatch (
bex_vm/src/vm.rs)exec_compactnow wrapsstep_compactin a tight innerloop {}that keepspc/code/function/frame_idxas localsstep_compactis#[inline(always)]and reads operands viaread_u32_unchecked(code, pc)— no frame indirectionmem::transmute(skip 90-armTryFrom<u8>validation)pcback to the frame and break the inner loopReturnupdates*frame_idxafter popping so the outer loop detects the frame changeDisassembly (
bex_vm/src/debug.rs)display_compact_bytecodewalks the byte stream and prints decoded instructions with operand annotationsEngine wiring (
bex_engine/src/lib.rs)Functionobjects get compact bytecode at load time (Bytecode::lower_to_compact)exec_innerpath kept but unreachable —exec_innerunconditionally delegates toexec_compactTooling
--only-bamlflag toscripts/bench-real-worldfor faster A/B testing without python/node/bun runsTest plan
cargo test -p baml_tests— 1,861 tests passcargo clippy -p bex_vm -p bex_vm_types— cleancargo build --release --bin baml-cli— builds./scripts/bench-real-world --baseline <canary-cli>— all benchmarks pass with correct results, 1.0-1.2x faster🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
--only-bamlbenchmark option to run BAML-only comparisonsImprovements