Divinci-AI fork — tracks upstream
chrishayuk/larqland adds RFC-0001 mechanistic fact-editing (crown/edit/apply-patch/memit), Phase-1 unlearning (with revert-leak fix), Gemma 4 per-layer intermediate-size handling, and the CI isolation harness used by Divinci AI's LarQL service. Open vindex artifacts published at huggingface.co/Divinci-AI (Gemma 4 E2B, Qwen3-0.6B/8B/35B-MoE, Llama 3.1-8B, Ministral-3B, MedGemma 1.5-4B, GPT-OSS 120B + two 1-bit dissolution controls). Try the interactive viewer to explore them in 3D.
The model IS the database. Query neural network weights like a graph database. No GPU required.
LARQL decompiles transformer models into a queryable format called a vindex (vector index), then provides LQL (Lazarus Query Language) to browse, edit, and recompile the model's knowledge.
larql> USE "gemma3-4b.vindex";
Using: gemma3-4b.vindex (34 layers, 348.2K features, relations: 512 types)
larql> DESCRIBE "France";
France
Edges (L14-27):
capital → Paris 1436.9 L27 (probe)
language → French 35.2 L24 (probe)
continent → Europe 14.4 L25 (probe)
borders → Spain 13.3 L18 (probe)
larql> INSERT INTO EDGES (entity, relation, target)
... VALUES ("John Coyle", "lives-in", "Colchester");
Inserted 1 edge. Feature F8821@L26 allocated.
larql> INFER "The capital of France is" TOP 3;
1. Paris (97.91%)
2. the (0.42%)
3. a (0.31%)# Build
cargo build --release
# Pull a pre-built vindex from HuggingFace
larql pull hf://chrishayuk/gemma-3-4b-it-vindex
# List what's cached
larql list
# Run it — one-shot or chat
larql run gemma-3-4b-it-vindex "The capital of France is"
larql run gemma-3-4b-it-vindex # drops into chat mode
# Or extract locally — inference-ready at f16 by default
larql extract google/gemma-3-4b-it -o gemma3-4b.vindex
larql run gemma3-4b.vindex "Einstein is known for"larql extract defaults to --level inference (full local forward
pass) stored at f16. No flags needed for the common case.
Extract tiers and options
# Browse-only — gate KNN + embeddings, no forward pass (~3 GB for 4B)
larql extract google/gemma-3-4b-it -o gemma3-4b.vindex --level browse
# Attention-only — client-side slice for `run --ffn URL` (Act 2 demo)
larql extract google/gemma-3-4b-it -o gemma3-4b.attn.vindex --level attention
# Inference (default) — full local forward pass
larql extract google/gemma-3-4b-it -o gemma3-4b.vindex
larql extract google/gemma-3-4b-it -o gemma3-4b.vindex --level inference
# All — +lm_head +COMPILE extras (largest)
larql extract google/gemma-3-4b-it -o gemma3-4b.vindex --level all
# Q4_K/Q6_K inline (Ollama-compatible, smallest disk footprint)
larql extract google/gemma-3-4b-it -o gemma3-4b.vindex --quant q4k
# Maximum size reduction on Q4K — drop gate_vectors.bin, rebuild from
# interleaved_q4k.bin at load (~1.6 s cost on 4B, ~12 s on 31B)
larql extract google/gemma-3-4b-it -o gemma3-4b.vindex \
--quant q4k --drop-gate-vectors
# Uniform Q4_K on FFN — gate + up + down all Q4_K (default stores
# down as Q6_K). ~30 MB/layer smaller, ~1.5–1.7× faster decode down
# matmul. Adds ~1.5 % softmax drift; top-1 / top-5 preserved.
larql extract google/gemma-4-31b-it -o gemma4-31b.vindex \
--quant q4k --down-q4k
# Opt out of f16 (rarely wanted — doubles file sizes)
larql extract google/gemma-3-4b-it -o gemma3-4b.vindex --f32
# Convert from GGUF instead of extracting from safetensors
larql convert gguf-to-vindex model.gguf -o model.vindexextract-index is kept as a backwards-compatible alias of extract.
larql serve gemma3-4b.vindex --port 8080# Extract once, then carve deployment slices with `larql slice`.
# Either --preset or --parts a,b,c works; `--dry-run` previews.
larql extract google/gemma-4-31b-it -o gemma4-31b.vindex --quant q4k
# Client slice (7.4 GB for 31B Q4_K — attn + embed + norms + tokenizer)
larql slice gemma4-31b.vindex --preset client -o gemma4-31b.client.vindex
# Server slice (27 GB — gate + interleaved FFN + down_meta, no attention)
larql slice gemma4-31b.vindex --preset server -o gemma4-31b.server.vindex
# Server (holds the FFN half):
larql serve gemma4-31b.server.vindex --port 8080 --ffn-only
# Client (laptop — runs attention locally, FFN over HTTP):
larql run gemma4-31b.client.vindex --ffn http://server.local:8080 \
"The capital of France is"Other presets: browse (DESCRIBE/WALK only, no forward pass), router
(MoE router only, ADR-0003), all (full clone). See larql slice --help
for the explicit part list.
3-tier topology (ADR-0008). When laptop RAM matters, split the embedding table out to its own server:
# Attention-only client (no embed, no FFN — ~310 MB on 4B, 10× smaller than `client`)
larql slice gemma3-4b.vindex --preset attn -o gemma3-4b.attn.vindex
# Embed server slice (embed + tokenizer; paired with ADR-0008 embed-server)
larql slice gemma3-4b.vindex --preset embed -o gemma3-4b.embed.vindexThe 3-tier client + embed server + FFN server split unlocks the
"laptop in ~1 GB" version of the dense-remote topology for small
models. Full rationale in
docs/adr/0007-vindex-distribution.md
and docs/adr/0008-embed-server.md.
larql publish combines slice + hf publish and adds HuggingFace
collections: one run uploads six sibling repos and files them into
three nested collections (model / family / library) for discovery.
# One command. Six repos (full + client + attn + embed + server + browse).
# Three collections (model / family / library).
larql publish gemma4-31b.vindex --repo chrishayuk/gemma-4-31b-it-vindex
# Preview without touching HF
larql publish gemma4-31b.vindex --repo chrishayuk/gemma-4-31b-it-vindex --dry-runSkip-if-unchanged. Each upload compares the local SHA256 against the
remote lfs.oid. Files that already match skip the transfer. Re-publishing
a ~27 GB server slice where nothing changed re-uploads only the manifest —
not 27 GB of weights. Override with --force-upload.
Streaming + progress. Uploads stream the file (no 27 GB-into-RAM pre-read) and report live progress via a per-file bar. An interrupted run picks up on the next invocation: completed files skip via SHA, the interrupted file re-uploads.
Flags: --no-full, --slices client,server, --collections model,family,
--model-title, --family, --library-title, --slice-repo-template,
--force-upload, --dry-run. Requires HF_TOKEN or
~/.huggingface/token.
larql pull mirrors publish on the download side: pick a specific
sibling, pull them all, or pull a whole collection. Each file gets an
indicatif progress bar; hf-hub resumes interrupted downloads from the
.incomplete partial on the next run.
# Plain pull — the full vindex. Shows a hint at the end listing
# any `-client` / `-attn` / `-embed` / `-server` / `-browse` siblings
# that exist on HF.
larql pull chrishayuk/gemma-4-31b-it-vindex
# Pull just the client slice (laptop side of `run --ffn URL`)
larql pull chrishayuk/gemma-4-31b-it-vindex --preset client
# Pull full + every default sibling in one command
larql pull chrishayuk/gemma-4-31b-it-vindex --all-slices
# Pull every dataset in an HF collection — works on the collection URL
# from larql publish or the slug alone.
larql pull --collection chrishayuk/gemma-4-31b-it-larql-vindex-abc123Bounding server RSS. --ffn-only skips the eager gate warmup at
startup (55 GB → 5.6 GB on 31B Q4_K). For steady-state bounds, layer
each of these on as needed:
larql serve gemma4-31b.vindex --port 8080 --ffn-only \
--layers 0-19 \ # hard bound: this shard serves only layers 0-19
--max-gate-cache-layers 4 \ # LRU cap on decoded f16 gate heap
--release-mmap-after-request # madvise(DONTNEED) post-request (Linux strict)--layers is the reliable hard bound on both Linux and macOS.
--release-mmap-after-request is strict on Linux, advisory on Darwin.
See docs/adr/0005-ffn-service-memory-bounds.md for the measured
ceilings under each combination.
larql repl
larql lql 'USE "gemma3-4b.vindex"; DESCRIBE "France";'
larql lql 'USE "hf://chrishayuk/gemma-3-4b-it-vindex"; DESCRIBE "France";'All under larql dev <subcmd> (weight extraction, QK rank analysis,
OV→gate projection, circuit discovery, trajectory tracing, 20+ others):
larql dev --help
larql dev walk --prompt "The capital of France is" --index gemma3-4b.vindex --predictLegacy invocation larql walk … still works and transparently trampolines
to larql dev walk ….
A vindex is a directory containing a model's weights reorganised for queryability. Gate vectors become a KNN index. Embeddings become token lookups. Down projections become edge labels. The model IS the database.
gemma3-4b.vindex/
gate_vectors.bin # W_gate rows (KNN index, 3.3 GB)
embeddings.bin # W_embed matrix (token lookup, 2.5 GB)
down_meta.bin # Per-feature output metadata (binary)
index.json # Config, layer bands, provenance
tokenizer.json # Tokenizer
relation_clusters.json # Discovered relation types
feature_labels.json # Probe-confirmed labels
Three extraction levels:
| Level | CLI Flag | LQL Syntax | Size (f16) | Enables |
|---|---|---|---|---|
| Browse | --level browse (default) |
EXTRACT MODEL ... INTO ... |
~3 GB | DESCRIBE, WALK, SELECT |
| Inference | --level inference |
... WITH INFERENCE |
~6 GB | + INFER |
| All | --level all |
... WITH ALL |
~10 GB | + COMPILE |
Add --f16 to halve file sizes with negligible accuracy loss.
Two crate families. LARQL-specific crates own the vindex + LQL + server stack;
portable model-* crates carry primitives that any neural-model compiler
(LARQL, TinyModel, others) can consume.
# LARQL-specific
larql-models Model config, architecture traits, weight loading, quant/dequant
↓
larql-vindex Vindex lifecycle: extract, load, query, mutate, patch, save
↓
larql-core Graph algorithms, merge, diff
larql-inference Forward pass, BLAS-fused attention, Metal GPU, WalkFfn
↓
larql-lql LQL parser, executor, REPL, USE REMOTE client
↓
larql-server HTTP/gRPC server: serve vindexes over the network
larql-cli CLI commands (extract-index, build, serve, repl, convert, hf, verify)
# Portable (no LARQL deps; extract to sibling repo later)
model-compute bounded compute: native kernels (default) + wasmtime (opt-in)
The portable crate never imports larql-*. Flow is one-way: LARQL consumes
it (e.g. compile-time resolution of sum(1..100) via model_compute::native).
See crates/model-compute/README.md.
Owns the vindex lifecycle. Streaming extraction (mmap, no full model load), KNN via BLAS matmul, zero-copy mmap loading, split weight files, readonly base with patch overlay, clustering, f16 storage.
// Load (readonly base)
let index = VectorIndex::load_vindex(&path, &mut cb)?;
let patched = PatchedVindex::new(index);
// Query
let hits = patched.gate_knn(layer, &query, 10); // 0.008ms/layer
let trace = patched.walk(&query, &layers, 10); // multi-layer scan
// Mutate (patch overlay — base files never modified)
patched.insert_feature(layer, feature, gate_vec, meta);
patched.apply_patch(VindexPatch::load("edits.vlp")?);LQL parser and executor. 20+ statement types across 5 categories:
- Lifecycle: EXTRACT, COMPILE, DIFF, USE
- Browse: WALK, DESCRIBE, SELECT, EXPLAIN WALK
- Inference: INFER, EXPLAIN INFER
- Mutation: INSERT, DELETE, UPDATE, MERGE
- Patches: BEGIN PATCH, SAVE PATCH, APPLY PATCH, SHOW PATCHES, REMOVE PATCH
- Introspection: SHOW RELATIONS/LAYERS/FEATURES/MODELS/PATCHES, STATS
See docs/specs/lql-spec.md for the full language specification and docs/lql-guide.md for a quick start guide.
-- Decompile a model
EXTRACT MODEL "google/gemma-3-4b-it" INTO "gemma3-4b.vindex" WITH ALL;
-- Browse knowledge (no GPU needed)
USE "gemma3-4b.vindex";
DESCRIBE "France"; -- verbose by default: [relation] labels, also-tokens
DESCRIBE "Einstein" ALL LAYERS;
DESCRIBE "France" BRIEF; -- compact view
WALK "The capital of France is" TOP 10;
-- Run inference (needs model weights in vindex)
INFER "The capital of France is" TOP 5 COMPARE;
-- Trace the residual stream (decomposed forward pass)
TRACE "The capital of France is" FOR "Paris";
TRACE "The capital of France is" DECOMPOSE LAYERS 22-27;
TRACE "The capital of France is" SAVE "france.trace";
-- Edit knowledge (auto-patch: base files never modified)
INSERT INTO EDGES (entity, relation, target)
VALUES ("John Coyle", "lives-in", "Colchester");
-- "Auto-patch started (use SAVE PATCH to persist)"
-- Insert with all knobs (multi-layer constellation, validated regime)
INSERT INTO EDGES (entity, relation, target)
VALUES ("Atlantis", "capital-of", "Poseidon")
AT LAYER 24
CONFIDENCE 0.95
ALPHA 0.30;
-- Patches (lightweight, shareable knowledge diffs)
BEGIN PATCH "medical.vlp";
INSERT INTO EDGES (entity, relation, target)
VALUES ("aspirin", "treats", "headache");
SAVE PATCH;
APPLY PATCH "medical.vlp";
-- Bake the patches into a fresh standalone vindex (instant on APFS:
-- weight files are hardlinked from source, only down_weights.bin gets
-- the override columns rewritten in place).
COMPILE CURRENT INTO VINDEX "gemma3-4b-medical.vindex";
-- Or recompile back to standard HuggingFace / GGUF format. The
-- constellation is in the standard down_proj tensors, so loading in
-- Transformers or GGUF runtimes Just Works — no special loader code.
COMPILE CURRENT INTO MODEL "edited/" FORMAT safetensors;Patches are lightweight JSON files (.vlp) that capture INSERT/DELETE/UPDATE operations. They overlay an immutable base vindex without modifying it.
-- Create a patch
BEGIN PATCH "medical-knowledge.vlp";
INSERT INTO EDGES (entity, relation, target)
VALUES ("aspirin", "side_effect", "bleeding");
SAVE PATCH;
-- Apply patches (stackable, reversible)
APPLY PATCH "medical-knowledge.vlp";
APPLY PATCH "fix-hallucinations.vlp";
SHOW PATCHES;
REMOVE PATCH "fix-hallucinations.vlp";
-- Extract diff between two vindexes as a patch
DIFF "base.vindex" "edited.vindex" INTO PATCH "changes.vlp";A single fact is ~10 KB. A 1,000-fact domain patch is ~10 MB. Compared to the full model at 8 GB, that's 1/800th the size. No fine-tuning, no GPU, no retraining.
The base vindex is always readonly. INSERT/DELETE/UPDATE automatically create a patch overlay. Edits are never written to base files.
Declarative model builds. Like a Dockerfile for model knowledge.
# Vindexfile
FROM hf://chrishayuk/gemma-3-4b-it-vindex
PATCH hf://medical-ai/drug-interactions@2.1.0
PATCH ./patches/company-facts.vlp
INSERT ("Acme Corp", "headquarters", "London")
LABELS hf://chrishayuk/gemma-3-4b-it-labels@latest
EXPOSE browse inferencelarql build . # build from Vindexfile
larql build . --stage prod # named stage
larql build . --output custom.vindex # custom output pathInput formats: safetensors (HuggingFace), GGUF (llama.cpp, dequantized to f32), MLX (Apple, same safetensors layout).
| Family | Models | FFN Type |
|---|---|---|
| Gemma | Gemma 2/3/4 (2B-31B) | Gated (GeGLU) |
| Llama | Llama 2/3 (7B-405B) | Gated (SiLU) |
| Mistral | Mistral 7B | Gated (SiLU) |
| Mixtral | Mixtral 8x7B, 8x22B | MoE (8 experts) |
| Qwen | Qwen 2/2.5 (0.5B-72B) | Gated (SiLU) |
| Phi | Phi 2/3 (2.7B-14B) | Gated |
| DeepSeek | DeepSeek V2/V3 | MoE (shared + routed) |
| GPT-OSS | GPT-OSS-120B | MoE (128 experts, MXFP4) |
| GPT-2 | GPT-2 (117M-1.5B) | Dense (GELU) |
Dense and full-precision MoE models support all operations (DESCRIBE, WALK, INFER). MXFP4-quantized MoE models (GPT-OSS) can be extracted and served but DESCRIBE/WALK produce noisy results due to 4-bit weight precision — use INFER for accurate knowledge queries. See operations spec for details.
| Operation | Latency |
|---|---|
| Gate KNN (per layer) | 0.008ms |
| Walk (34 layers) | 0.3ms |
| Feature lookup | <1ns |
| Save gates (8 MB) | 1.1ms |
| Load vindex | 8ms |
| Mutate (meta + gate) | 617ns |
| Operation | Latency | tok/s |
|---|---|---|
| GPU Q4K decode (Metal, 34L, KV cache) | 15.6ms | 64 |
| Walk prediction (CPU, no attention) | 33ms | 30 |
| INFER walk (CPU, with attention, mmap FFN) | 517ms | 1.9 |
| INFER dense (CPU, all matmul) | 535ms | 1.9 |
| DESCRIBE (knowledge browse) | 33ms | — |
GPU decode per-stage breakdown:
| Component | Time | % of total |
|---|---|---|
| GPU forward (34 layers, Q4K/Q6K) | 14.1ms | 86% |
| LM head (Q4_0 synthesized from f16 embeddings) | 2.0ms | 12% |
| Embed + norm + detokenize | <0.1ms | <1% |
CPU walk breakdown:
| Component | Time | % of total |
|---|---|---|
| Logits (262K vocab gemv) | 221ms | 41% |
| FFN × 34 layers (walk) | 194ms | 36% |
| Attention × 34 layers | 84ms | 16% |
Walk is faster than dense (517ms vs 535ms). GPU Q4K decode is 16× faster than CPU walk. FFN down projection in walk reads from mmap'd vindex (zero-copy BLAS). Walk only needs ~3.5GB of model weights (attention + embeddings), not 16.6GB. No quantization. See docs/ffn-graph-layer.md for architecture and docs/inference-engine.md for engine details.
Capture the complete record of inference — every layer, every contribution, queryable.
-- LQL: answer trajectory through all layers
larql> TRACE "The capital of France is" FOR "Paris";
Layer Rank Prob Attn FFN Who
L22 50 0.002 +22.2 +34.4 BOTH ↑
L23 10 0.024 -16.9 +55.9 FFN ↑
L24 1 0.714 +105.7 +24.4 BOTH ↑ ← phase transition
L25 1 0.997 +4.3 +94.4 FFN ↑
L26 1 0.999 +83.1 +18.7 BOTH ↑
-- Attn vs FFN decomposition at the phase transition
larql> TRACE "The capital of France is" DECOMPOSE LAYERS 22-27;
-- Persist for later analysis
larql> TRACE "The capital of France is" SAVE "france.trace";# Python: same trace, programmatic access
import larql
wm = larql.WalkModel("gemma3-4b.vindex")
t = wm.trace("The capital of France is")
t.answer_trajectory("Paris") # rank, prob, attn/ffn logits per layer
t.top_k(24) # [('Paris', 0.714), ...]
t.save("trace.bin") # mmap'd store| Storage | Per window | 370K tokens | vs KV cache |
|---|---|---|---|
| Boundary residual | 10 KB | 18.9 MB | 3,100x |
| Tier 4 int8 (bit-perfect) | 58 KB | 110 MB | 511x |
| KV cache | ~30 MB | 56,000 MB | 1x |
from larql._native import BoundaryWriter, BoundaryStore
# Write boundary residuals — one per 200-token window
writer = BoundaryWriter("context.bndx", hidden_size=2560, window_size=200)
writer.append(token_offset=0, window_tokens=200, residual=boundary_vec)
writer.finish()
# Mmap'd read — OS pages on demand, RSS ≈ one boundary
store = BoundaryStore("context.bndx")
store.residual(42) # zero-copy from mmapSee docs/residual-trace.md for the full writeup.
| Doc | Description |
|---|---|
| docs/specs/lql-spec.md | LQL language specification (v0.3) |
| docs/specs/vindex-format-spec.md | Vindex file format specification (v0.3, ~98% implemented) |
| docs/specs/vindex-operations-spec.md | Vindex operations, API, patches (~98% implemented) |
| docs/specs/vindex-ecosystem-spec.md | Distributed hosting, HuggingFace, Vindexfile (~85% implemented) |
| docs/lql-guide.md | LQL quick start guide |
| docs/cli.md | CLI reference |
| docs/inference-engine.md | Inference engine — BLAS-fused attention, Metal GPU, auto-calibration |
| docs/ffn-graph-layer.md | FFN graph layer — mmap walk faster than dense (517ms vs 535ms), all 34 layers |
| docs/walk-boundary-sweep.md | Walk boundary sweep — correctness proof across all layer boundaries |
| docs/residual-trace.md | Residual stream trace — decomposition, storage, tiered context |
| docs/specs/trace-format-spec.md | Trace file format specification (.bin, .bndx, .ctxt) |
(Needs Openblas under Linux)
cargo build --release # optimised build
cargo build --release --features metal # with Metal GPU backend
cargo test # all tests across all crates
cargo test -p larql-inference # inference engine tests (109 tests)
cargo test -p larql-inference --features metal # + Metal GPU tests (115 tests)
cargo test -p larql-lql # LQL parser + executor tests (272 tests)
cargo test -p larql-vindex # vindex storage + patch tests (104 tests)
# Inference engine examples
cargo run --release -p larql-inference --example attention_demo # fused attention demo
cargo run --release -p larql-inference --example bench_attention # attention benchmarks
cargo run --release -p larql-inference --example backend_demo --features metal # backend demo
cargo run --release -p larql-inference --example bench_backend --features metal # backend benchmarks
cargo run --release -p larql-inference --example bench_inference # full inference benchmarks
# Vindex tools (build once, enables mmap walk)
cargo run --release -p larql-vindex --example convert_gates_f32 -- path/to/vindex # f16→f32 gate vectors
cargo run --release -p larql-vindex --example build_down_features -- path/to/vindex # feature-major down vectors
cargo run --release -p larql-vindex --example build_up_features -- path/to/vindex # feature-major up vectors
# Server (walk inference over HTTP)
cargo run --release -p larql-server -- path/to/vindex --port 8080
# Vindex and LQL demos (synthetic — run in CI)
cargo run -p larql-vindex --example demo_features # vindex feature showcase
cargo run --release -p larql-vindex --example mmap_demo # mmap RAM behaviour + scaling table
cargo run --release -p larql-vindex --example q4k_demo # streaming Q4_K: size ratio, manifests, dequant round-trip
cargo run --release -p larql-vindex --example demo_memit_solve # MEMIT decomposition + MemitStore round-trip
cargo run -p larql-lql --example parser_demo # parser demo (24/24 statements)
cargo run -p larql-lql --example lql_demo # LQL spec compliance (61/61)
cargo run --release -p larql-lql --example compact_demo # LSM storage tier walkthrough
# Model-dependent demos (require real vindex, skip gracefully otherwise)
cargo run --release -p larql-lql --example compile_demo # end-to-end COMPILE INTO VINDEX on real Gemma 4B
cargo run --release -p larql-lql --example refine_demo # 10-fact INSERT + COMPILE (exp 14 reproduction, 10/10 retrieval)
cargo run --release -p larql-lql --example trace_demo # TRACE residual decomposition on real Gemma 4B
# Criterion benches (use --quick for a fast sweep, omit for full sample sizes)
cargo bench -p larql-lql --bench parser # parse_single × 18 + parse_batch
cargo bench -p larql-lql --bench executor # SELECT, SHOW, DELETE, UPDATE, patch lifecycle
cargo bench -p larql-lql --bench compile # COMPILE INTO VINDEX bake cost
cargo bench -p larql-vindex --bench vindex_ops # KNN, walk, save/load, mutate, MoE
cargo bench -p larql-vindex --bench vindex_scaling # production-dim KNN (Gemma/Llama/Mixtral)
cargo bench -p larql-vindex --bench memit_solve # ridge decomposition throughput
cargo bench -p larql-vindex --bench extract_throughput # streaming extract: f32 vs Q4K write-path
cargo bench -p larql-vindex --bench q4k_vs_f32 # per-layer attn retrieval: f32 memcpy vs Q4K dequant
cargo bench -p larql-compute --bench matmul # CPU/Metal matmul backendsThe compile_demo example proves the full flow on a real Gemma 4B
vindex: INSERT Atlantis → Poseidon, COMPILE CURRENT INTO VINDEX,
then USE the compiled vindex in a fresh session and verify
INFER "The capital of Atlantis is" → Pose 56.91% and
INFER "The capital of France is" → Paris 67.34% (neighbour
preserved). The constellation is baked into down_weights.bin
column-wise — no overlay or sidecar needed at load time.
Bench HTML reports go to target/criterion/. The parser bench
parses 100 mixed statements in ~78 µs (1.28 M stmts/s); vindex_ops
runs production-sized Gemma 4B gate KNN in ~2.78 ms/layer; compile
runs COMPILE INTO VINDEX in ~1.84 ms (no patches) to 2.41 ms (with
down_weights.bin).
Apache-2.0