Skip to content

perf: compact bytecode encoding (1.0-1.2x VM speedup, 7.9x on field access)#3457

Merged
hellovai merged 10 commits into
canaryfrom
compact-bytecode-encoding-with-fused-opcodes
May 4, 2026
Merged

perf: compact bytecode encoding (1.0-1.2x VM speedup, 7.9x on field access)#3457
hellovai merged 10 commits into
canaryfrom
compact-bytecode-encoding-with-fused-opcodes

Conversation

@hellovai
Copy link
Copy Markdown
Contributor

@hellovai hellovai commented May 3, 2026

Summary

Re-encodes the VM's Vec<Instruction> (24-byte enum) into a compact byte stream Vec<u8> with 1-byte opcodes and inline operands. Restructures dispatch to keep pc/code as locals in a tight inner loop, only breaking out for actual control-flow changes.

Result: 1.0–1.2x faster on most benchmarks, 7.9x faster on nested field access (compact code = better icache utilization).

Benchmark results vs baseline

Workload Speedup
fib(32) recursive 1.0x
fib iterative 100k 1.2x
binary tree depth 20 1.1x
nested loops 500x500 1.0x
closure apply 1M 1.1x
if-else dispatch 1M 1.2x
class instances 100k 1.1x
array build+sum 100k 1.0x
string split 100k 1.0x
call chain 100x10k 1.0x
nested field access 500k 7.9x
guard clauses 1M 1.2x
collatz 100k 1.2x
bubble sort 5k 1.2x
grid alloc 1000x100 1.1x
divide guard 1M 1.2x

Key changes

Encoding (bex_vm_types/src/bytecode.rs)

  • New #[repr(u8)] OpCode enum with ~90 variants (1/2/5/9-byte encoding groups)
  • CompactCode { code: Vec<u8>, jump_tables: Vec<CompactJumpTable> } produced once per function at engine load time
  • Two-pass encoding: pass 1 builds index→byte-offset map, pass 2 emits bytes with translated jump offsets
  • Specialized opcodes (AddInt, SubInt, CmpIntLt, etc.) get their own bytes — no type dispatch at runtime

Dispatch (bex_vm/src/vm.rs)

  • exec_compact now wraps step_compact in a tight inner loop {} that keeps pc/code/function/frame_idx as locals
  • step_compact is #[inline(always)] and reads operands via read_u32_unchecked(code, pc) — no frame indirection
  • Opcode byte read via mem::transmute (skip 90-arm TryFrom<u8> validation)
  • Only Call/CallIndirect/Return/Throw save pc back to the frame and break the inner loop
  • Return updates *frame_idx after popping so the outer loop detects the frame change

Disassembly (bex_vm/src/debug.rs)

  • display_compact_bytecode walks the byte stream and prints decoded instructions with operand annotations

Engine wiring (bex_engine/src/lib.rs)

  • All Function objects get compact bytecode at load time (Bytecode::lower_to_compact)
  • Legacy exec_inner path kept but unreachable — exec_inner unconditionally delegates to exec_compact

Tooling

  • Added --only-baml flag to scripts/bench-real-world for faster A/B testing without python/node/bun runs

Test plan

  • cargo test -p baml_tests — 1,861 tests pass
  • cargo clippy -p bex_vm -p bex_vm_types — clean
  • cargo build --release --bin baml-cli — builds
  • ./scripts/bench-real-world --baseline <canary-cli> — all benchmarks pass with correct results, 1.0-1.2x faster

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Compact bytecode encoding for smaller, more efficient function representations
    • Bytecode disassembler for readable inspection of compact bytecode
    • --only-baml benchmark option to run BAML-only comparisons
  • Improvements

    • Better VM error reporting for invalid/unknown opcode bytes

hellovai and others added 8 commits May 1, 2026 12:51
…de, encoding pass)

Define 71-variant OpCode enum with 1-byte opcodes and fixed operands,
implement two-pass encoder that transforms Vec<Instruction> into Vec<u8>,
and wire encoding into engine loading so all functions get compact
bytecode populated at load time. The old dispatch path is unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…p_compact)

Implement exec_compact() and step_compact() dispatch methods that read
opcodes from the compact Vec<u8> stream. Add faulting_pc to BytecodeFrame
for correct PC recovery in exception handling and stack traces. Extract
shared exec_cmpop() and exec_binop() helpers. All existing tests now
execute through the compact dispatch path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e 3)

Add display_compact_bytecode() that walks the compact Vec<u8> stream and
produces human-readable disassembly with byte offsets, opcode mnemonics,
operands, and resolved jump targets. Add Display impl for OpCode mapping
each variant to its SCREAMING_SNAKE_CASE mnemonic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The compact dispatch was cloning the entire code buffer (compact.code.clone())
for every operand read due to a borrow checker workaround. Since `function`
is &'static Function, the code slice lifetime doesn't conflict with &mut self.

Fix: extract `code: &'static [u8]` once at the top of step_compact and
pass it directly to read_u32/read_i32/read_i8, eliminating 37 heap
allocations per instruction dispatch.

Also switch read_u32/read_i32 from 4 individual byte accesses to a single
slice operation (one bounds check instead of four).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… path

Since the compact bytecode is produced by our own encoder (with sizes
verified by debug_assert in lower_to_compact), we know all byte reads
are in bounds. Use get_unchecked to eliminate bounds checks on every
operand read in step_compact, removing ~4 bounds checks per instruction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eClosure, JumpTable)

Adapt compact bytecode encoding to canary's new instruction set:
- Map specialized arithmetic instructions (AddInt, SubFloat, etc.) to generic opcodes
- Fix MakeClosure to single operand (capture_count now on stack)
- Fix JumpTable from struct to tuple variant with default in JumpTableData
- Update step_compact dispatch to match new ensure_pop() API (Value, not Result)
- Handle _Pad variant with unreachable!()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructured the compact bytecode VM dispatch to match (and exceed) the
performance of the baseline instruction-array dispatch. Key changes:

- **Specialized opcodes**: Added dedicated AddInt/SubInt/CmpIntLt/etc.
  with unreachable_unchecked fast paths instead of mapping to generic
  Add/Sub that go through exec_binop's type dispatch.

- **Transmute opcode byte**: Skip the 90-arm TryFrom<u8> validation match
  on every instruction. The bytecode is produced by our own encoder.

- **Tight inner dispatch loop**: Keep pc/code/function/frame_idx as
  locals across many instructions. Only break out to re-extract from
  the frame on actual control-flow changes (Call, Return, Throw).
  Simple ops (arithmetic, load/store, jumps within a function) never
  touch the frame's instruction_ptr.

- **Return updates frame_idx**: After self.frames.pop(), set frame_idx
  to the new top so the outer loop detects the frame change correctly.

- **Dead-code legacy dispatch**: exec_inner unconditionally delegates
  to exec_compact, freeing icache pressure from the 2300-line step()
  function that's no longer reachable.

Benchmark results vs baseline (bench-real-world):
  fib iterative 100k:     1.2x faster
  if-else dispatch 1M:    1.2x faster
  guard clauses 1M:       1.2x faster
  collatz 100k:           1.2x faster
  bubble sort 5k:         1.2x faster
  divide guard 1M:        1.2x faster
  nested field access:    7.9x faster (compact bytecode icache win)
  most others:            1.0-1.1x faster

Also adds --only-baml flag to bench-real-world for faster A/B testing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
baml-website-redesign Error Error May 3, 2026 7:58pm
beps Ready Ready Preview, Comment May 3, 2026 7:58pm
promptfiddle Ready Ready Preview, Comment May 3, 2026 7:58pm

Request Review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d6c1f5a6-ea72-452d-84db-7fe2abb837cf

📥 Commits

Reviewing files that changed from the base of the PR and between 1a9b03a and 64b3dbc.

📒 Files selected for processing (3)
  • baml_language/crates/bex_vm/src/debug.rs
  • baml_language/crates/bex_vm/src/vm.rs
  • baml_language/crates/bex_vm_types/src/bytecode.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • baml_language/crates/bex_vm/src/debug.rs
  • baml_language/crates/bex_vm_types/src/bytecode.rs

📝 Walkthrough

Walkthrough

Adds a compact bytecode representation, lowering full instruction bytecode into a compact 1-byte-opcode form at engine construction, plus a disassembler and helpers; also adds a CLI flag to restrict real‑world benchmarks to BAML only. (49 words)

Changes

Compact Bytecode Encoding & Integration

Layer / File(s) Summary
Data Shape & Structures
baml_language/crates/bex_vm_types/src/bytecode.rs
Adds OpCode enum, CompactCode, CompactJumpTable, and Bytecode::compact: Option<CompactCode> field.
Operand Readers & Utils
baml_language/crates/bex_vm_types/src/bytecode.rs
Adds read_u32, read_i32, read_i8 helpers and OpCode::encoded_size()/TryFrom<u8>/Display.
Lowering & Encoding Algorithm
baml_language/crates/bex_vm_types/src/bytecode.rs
Implements Bytecode::lower_to_compact() (two-pass encoder) and instruction_to_opcode() to specialize opcodes and translate instruction-relative offsets to byte-relative offsets; rewrites line/exception table PCs and jump table arms.
Unit Tests
baml_language/crates/bex_vm_types/src/bytecode.rs
compact_tests cover specialization, arithmetic expansion, forward/backward jumps, jump tables, and PC translation for line/exception tables.
Engine Integration
baml_language/crates/bex_engine/src/lib.rs
BexEngine::new() lowers all compile-time Object::Function bytecode via func.bytecode.lower_to_compact() before building/freezing the heap.
Disassembler / Debug Display
baml_language/crates/bex_vm/src/debug.rs
Adds pub fn display_compact_bytecode(compact: &CompactCode, constants: &[ConstValue], f: &mut impl std::fmt::Write) -> std::fmt::Result that decodes bytes to OpCode, renders operands and resolved jump targets, and returns Err on invalid opcode bytes.
Error Variant
baml_language/crates/bex_vm/src/errors.rs
Adds VmInternalError::InvalidOpcode(u8) for invalid compact opcode bytes.

Benchmark Script Enhancement

Layer / File(s) Summary
CLI Option & Runtime Detection
baml_language/scripts/bench-real-world
Adds --only-baml flag; when set, disables Python/Node/Bun detection by setting has_python, has_node, has_bun to False, preventing those benchmarks from running.

Sequence Diagram

sequenceDiagram
    participant Engine as BexEngine::new
    participant Bytecode as Bytecode
    participant Encoder as lower_to_compact()
    participant Compact as CompactCode
    participant Debug as display_compact_bytecode()
    participant Decoder as OpCode::try_from

    Engine->>Bytecode: iterate compile-time Function objects
    Bytecode->>Encoder: call lower_to_compact()
    Encoder->>Encoder: Pass 1: map instruction -> byte offset
    Encoder->>Encoder: Pass 2: emit opcodes, operands, translate jump/table offsets
    Encoder->>Compact: produce CompactCode (code, tables, metadata)
    Bytecode->>Bytecode: set compact = Some(CompactCode)
    
    Note over Debug,Compact: debug display of compact code
    Debug->>Compact: iterate bytes by PC
    Debug->>Decoder: read opcode byte -> OpCode
    Decoder-->>Debug: OpCode variant
    Debug->>Debug: decode operands (read_i8/i32/u32), resolve jump targets, format output
    Debug->>Debug: write to fmt::Write buffer
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Poem

🐰 In bytes I hop, compressed and neat,

Opcodes snug in every beat,
Lowered warm before the heap is sealed,
Jumps and tables now all revealed,
Debug prints sparkle — compact and sweet.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: compact bytecode encoding with specific performance improvements (1.0-1.2x VM speedup, 7.9x on field access).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch compact-bytecode-encoding-with-fused-opcodes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@baml_language/crates/bex_vm_types/src/bytecode.rs`:
- Around line 694-697: The MAKE_CLOSURE opcode's compact format is inconsistent:
the enum comment says it has two operands (u32 object_idx + u32 capture_count)
but encoded_size(), lower_to_compact(), and the compact disassembler only
handle/emit one u32, causing misalignment; fix by choosing the canonical layout
(two u32s as documented) and update all places to match—adjust encoded_size() to
return the correct byte count for MakeClosure, modify lower_to_compact() to emit
both object_idx and capture_count for MakeClosure, and update the compact
disassembler/parser to read two u32 operands for MakeClosure (also apply the
same consistency fixes to the other occurrences noted around the alternative
ranges). Ensure JumpTable handling remains unchanged and run tests to verify
byte alignment after the first MAKE_CLOSURE.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c9e4e744-cfc5-43c6-a908-f1d6470128c6

📥 Commits

Reviewing files that changed from the base of the PR and between 3f888bf and 1a9b03a.

📒 Files selected for processing (6)
  • baml_language/crates/bex_engine/src/lib.rs
  • baml_language/crates/bex_vm/src/debug.rs
  • baml_language/crates/bex_vm/src/errors.rs
  • baml_language/crates/bex_vm/src/vm.rs
  • baml_language/crates/bex_vm_types/src/bytecode.rs
  • baml_language/scripts/bench-real-world

Comment thread baml_language/crates/bex_vm_types/src/bytecode.rs
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 3, 2026

Merging this PR will improve performance by 34.15%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 9 improved benchmarks
✅ 10 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime vm_call_chain_100_x_5k 36.6 ms 33.1 ms +10.68%
WallTime vm_field_access_50k 9.7 ms 7.6 ms +27.24%
WallTime vm_nested_loop 6.1 ms 5.4 ms +14.48%
WallTime vm_array_iter_10k 5.8 ms 4.7 ms +22.2%
WallTime vm_closure_call_50k 10.7 ms 9.1 ms +17.32%
WallTime vm_array_push_50k 9.7 ms 8.2 ms +19.03%
WallTime vm_loop_500k 45.4 ms 35.7 ms +27.15%
WallTime vm_fib_20 5.5 ms 4.7 ms +16.95%
WallTime vm_class_create_50k 19.3 ms 14.4 ms +34.15%

Comparing compact-bytecode-encoding-with-fused-opcodes (64b3dbc) with canary (3f888bf)

Open in CodSpeed

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

Binary size checks passed

7 passed

Artifact Platform Gzip Baseline Delta Status
bridge_cffi Linux 6.1 MB 5.7 MB +428.8 KB (+7.6%) OK
bridge_cffi-stripped Linux 6.1 MB 5.7 MB +399.9 KB (+7.0%) OK
bridge_cffi macOS 5.0 MB 4.6 MB +401.1 KB (+8.7%) OK
bridge_cffi-stripped macOS 5.0 MB 4.7 MB +334.0 KB (+7.2%) OK
bridge_cffi Windows 5.0 MB 4.6 MB +410.9 KB (+8.9%) OK
bridge_cffi-stripped Windows 5.0 MB 4.7 MB +350.5 KB (+7.5%) OK
bridge_wasm WASM 3.3 MB 3.2 MB +86.8 KB (+2.7%) OK

Generated by cargo size-gate · workflow run

The early_yield tests use BexVm::from_program directly (bypassing BexEngine
which previously did the lowering). exec_compact would then panic on
compact.as_ref().unwrap() with None.

Move the lower_to_compact() call into convert_program so every caller path
gets compact bytecode regardless of whether they go through BexEngine.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The compact disassembler was reading 2 u32 operands for MakeClosure, but
the compact encoding only emits 1 (obj_idx) — capture_count is popped
from the stack at runtime (pushed by a preceding LoadConst).

This caused the disassembler to consume the next instruction's opcode
byte as the phantom "captures" operand, misaligning all subsequent
output after the first MAKE_CLOSURE.

Also update the misleading comment on the MakeClosure variant declaration
to reflect the actual single-operand encoding.

Note: encoded_size() (5 bytes) and lower_to_compact() (1 u32 emit) were
already correct — only the disassembler diverged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hellovai hellovai added this pull request to the merge queue May 4, 2026
Merged via the queue into canary with commit c8ab281 May 4, 2026
43 of 44 checks passed
@hellovai hellovai deleted the compact-bytecode-encoding-with-fused-opcodes branch May 4, 2026 01:48
antoniosarosi added a commit that referenced this pull request May 6, 2026
…ttypeof

Reconcile reflect.type_of<T>() runtime substrate with canary's compact
bytecode encoding (#3457). Canary made bytecode dispatch table-driven via
fixed-size compact opcodes; reflect added struct-variant operands carrying
ntypeargs/capture_count for generic forwarding. Both paths needed to
co-exist: extend the compact format to thread the new fields end-to-end.

- bex_vm_types/bytecode.rs: AllocInstance/Call grow to 7 bytes (+u16
  ntypeargs); MakeClosure grows to 9 bytes (+u16 capture_count, +u16
  ntypeargs). New OpCode::LoadType (5 bytes, u32 const idx). Updated
  encoder pass 2, encoded_size, OpCode::try_from, and Display name.
- bex_vm/vm.rs: compact dispatch reads the new operand bytes. Call,
  AllocInstance, MakeClosure pop ntypeargs Object::Type values into
  frame.type_args / class_type_args / captured_type_args. New
  OpCode::LoadType handler substitutes TyTemplate against frame
  type_args. exec_cmpop gains an Object::Type arm for structural ==.
  IsType compact path handles ConstValue::ClassWithTypeArgs.
- bex_vm/debug.rs: disassembler prints the new fields.
- baml_compiler2_ast/lower_expr_body.rs: drop reflect's
  lower_match_pattern in favour of canary's PATTERN-CST lower_pattern
  family (reflect's only caller already migrated via auto-merge).
- baml_compiler2_mir/lower.rs: alias TypeExpr→AstTypeExpr to match
  canary's import shape.
- BytecodeFrame: keep type_args (pub) alongside canary's
  pub(crate) faulting_pc.
- Snapshot regen: cargo insta accept (mostly line-shift in llm_types.baml
  from canary's stdlib edits, plus new patterns_new and
  type_reflection_strict snapshots).
@coderabbitai coderabbitai Bot mentioned this pull request May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant