Browser-resident Gemma 4 inference in pure Rust → WebAssembly + WebGPU. Loads the same GGUF blobs Ollama already has on disk, runs the forward pass on your local GPU through hand-written WGSL, never touches a remote server.
The intent is a PWA-pluggable inference engine, not a port of Ollama-the-server. Ollama has 275K LOC of Go that wraps llama.cpp via CGO plus model registry, CLI, conversion tooling, multimodal pipelines — almost none of which apply to a browser library. What survives the scope cut is the core inference path over Ollama's storage format.
A two-crate Cargo workspace:
| Crate | Path | Target | Status |
|---|---|---|---|
rullama |
crates/rullama |
wasm + native | release-track |
rullama-finetune |
crates/rullama-finetune |
native only | experimental (M3 skeleton — see below) |
The iOS bench harness (tools/ios-bench) is a sibling crate excluded from the
workspace so cargo build --workspace --target wasm32-unknown-unknown doesn't
try to compile its staticlib for wasm.
- ✅
gemma4:e2btext inference on the desktop loads end-to-end and generates greedy output bit-identical to Ollama. (gemma4:e4bis shape-compatible — pull and try it.) - ✅
gemma4:e2btext inference on iPhone — full Q4_K_M model loaded into iPhone 16e (A18, 8 GB shared RAM) and streaming tokens at ~4.65 tok/s via a Dedicated Worker + sync OPFS path. Multimodal towers stay Mac-only for now; mobile picks the text-only loader (max_context=512). - ✅ Vision + audio multimodal on the desktop. ViT (16 blocks, 768
hidden) + Conformer (12 blocks, 1024 hidden) towers run on the same wgpu
device as the text path; soft tokens splice into the prompt via
<|image>/<|audio>sentinels. Validated bit-identical to Ollama on a fixed image and a 30-second pangram WAV. - ✅ Q4_K + Q6_K + F16 + F32 quants (the actual mix in
gemma4:e2bQ4_K_M). - ✅ Streaming load via HTTP byte-range requests or OPFS sync access
handles — the 7 GB GGUF never enters wasm linear memory in bulk. The
PWA writes to OPFS once via
FileSystemSyncAccessHandle.write()in a worker, and reads tile-by-tile during inference, so the wasm peak stays in the tens of MiB regardless of model size. - ✅ Multi-turn chat with system prompt, mid-generation Stop, persistent KV cache.
- ✅ Encoder chained + per-layer submits (M7 + M15) — one CommandEncoder spans each transformer layer, submitted incrementally so the GPU drains smoothly even on tight-RAM phones.
- 🧪 Local LoRA fine-tuning (
rullama-finetune, native-only) — backward kernels for matmul Q4_K / Q6_K, rmsnorm, rope, geglu, attention, cross-entropy; Adam optimizer over GPU buffers; rank-r LoRA on attention- FFN projections. M3 acceptance: 200-step overfit-one drops loss from ~12.5 → 0 on the dev fixture. Per-position CE (single-forward variant) shipped in M3. Wasm32 trainer is not in scope.
- ❌ MoE
gemma4:26b/gemma4:31b— out of scope. - ❌ Other architectures (llama, mistral, qwen, phi).
- 🛠️ Mobile multimodal — desktop multimodal works; the iPhone loader currently skips the vision/audio towers to fit in shared RAM. Lazy upload for those is a follow-up.
You need:
- Rust ≥ 1.91 +
wasm-pack(cargo install wasm-pack --locked --version 0.13.1) - A WebGPU-capable browser (Chrome 113+, Edge 113+, recent Firefox; iOS Safari 17.4+ for phones)
- Ollama installed locally with
gemma4:e2bpulled (ollama pull gemma4:e2b)
# wasm-pack runs against the rullama crate inside the workspace; the
# output lands at `<repo>/pkg/` so both example PWAs share one bundle.
wasm-pack build crates/rullama --target web --release --out-dir ../../pkgThis emits pkg/rullama.js + pkg/rullama_bg.wasm + TypeScript typings.
The repo ships two browser harnesses against the same wasm bundle:
| Path | Stack | Purpose |
|---|---|---|
examples/web/ |
React + Vite + Tailwind + Workbox SW | Production-quality chat PWA — service-worker-based offline shell, restart dialog on deploys, attachment UI, conversation history in OPFS + SQLite (rsqlite-wasm). |
examples/pwa/ |
Vanilla JS + bash | Bench harness and safaridriver-driven scripted iPhone runs. |
Pick examples/web/ for hacking on the user-facing chat experience; pick
examples/pwa/ for kernel benchmarks or hands-off iPhone perf reruns.
# React / Vite PWA — auto-runs the wasm bundle build via `pnpm dev`.
cd examples/web
pnpm install
pnpm dev # https://localhost:5173/
# Vanilla bench / iPhone harness — needs the wasm bundle built first.
CERT_FILE=~/.local/share/rullama/cert.pem KEY_FILE=~/.local/share/rullama/key.pem \
./examples/pwa/serve.sh
# Desktop: https://localhost:8088/examples/pwa/index.html
# iPhone: https://<mac-lan-ip>:8088/examples/pwa/index.htmlThe first load streams the ~7 GB blob from the local Ollama install (or an R2
mirror — see scripts/upload-models-to-r2.sh) through a Dedicated Worker that
owns a FileSystemSyncAccessHandle over OPFS. Bytes go network → sync handle
→ disk without ever pinning a Blob in the JS heap. Subsequent loads (within
the same Safari session) reuse the cached file.
The vanilla PWA is fully drivable from the Mac via Apple's safaridriver:
# One-time setup on the phone:
# Settings → Safari → Advanced → Remote Automation = on
# Web Inspector = on
# Feature Flags → WebGPU = on
# Then on the Mac:
safaridriver -p 4444 &
./examples/pwa/iphone-session-keeper.sh & # keep an OPFS scope alive
./examples/pwa/run-on-iphone.sh # navigate → Load → chat → log perf
./examples/pwa/clean-iphone.sh # wipe OPFS between trials/tmp/rullama-page.log collects beacon traces from the page ([chat],
[pe], [tg], [gen], [wkr], [rs]) so any regression in a phone
run leaves a server-side trail even after a WebContent crash.
The same code paths run natively against host wgpu (Metal on macOS, Vulkan on Linux). Useful for parity testing without a browser:
# Greedy parity vs Ollama (CPU oracle)
cargo run -p rullama --release --features cpu-reference --example greedy_parity -- \
~/.ollama/models/blobs/sha256-<digest> "Hi" 5
# Full-stack chat through the public Model API
cargo run -p rullama --release --features cpu-reference --example model_api -- \
~/.ollama/models/blobs/sha256-<digest> "Hi" --greedy --max=16
# Standalone chained forward (M7 perf path)
cargo run -p rullama --release --features cpu-reference --example chained_smoke -- \
~/.ollama/models/blobs/sha256-<digest> "Hi" --max=8--features cpu-reference is now a no-op (the f32 oracle is always built); the
flag is kept so existing scripts keep working.
rullama-finetune runs LoRA SGD against the live wgpu kernels — no Burn, no
PyTorch, no separate runtime. M3 scope: rank-r LoRAs on
attn_q / attn_k / attn_v / attn_o and the FFN projections, Adam, global
L2 grad clipping, gradient accumulation, mixed precision, gradient
checkpointing. PerPosition CE is a single-forward variant with a ~C/2 speedup
vs. the multi-forward path. Wasm32 builds compile to an empty crate.
# Overfit a single (prompt, target) pair — acceptance test that the
# backward path and Adam are wired correctly.
cargo run -p rullama-finetune --release --example overfit_one -- \
~/.ollama/models/blobs/sha256-<digest>
# Train on a JSONL dataset. See `crates/rullama-finetune/examples/data/echo.jsonl`
# for the format; env knobs documented in the example's docstring.
cargo run -p rullama-finetune --release --example train_jsonl -- \
~/.ollama/models/blobs/sha256-<digest> \
crates/rullama-finetune/examples/data/echo.jsonlCurrently the trainer mutates LoRAs in place against the loaded model — there is no on-disk checkpoint format yet, no adapter export, no inference hand-off back to the wasm runtime. That is the next milestone.
PWA (host page) ──┐
▼ postMessage RPC
┌──────────────────────────────────────────────────────────────────┐
│ inference-worker.js (Dedicated Worker) │
│ ▶ owns FileSystemSyncAccessHandle for the GGUF │
│ ▶ owns the wasm Model handle │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ wasm32 (Rust, the rullama crate) │ │
│ │ Model.loadFromOpfs(read_fn, total) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ GgufReader (header only, ~5 MB) │ │
│ │ │ │ │
│ │ │ TensorFetcher (OPFS sync read | HTTP Range)│
│ │ ▼ │ │
│ │ WeightCache ─────────▶ Forward / VisionForward / │ │
│ │ (lazy GPU upload, GpuAudioForward │ │
│ │ per-tile range fetch (per-layer encoder │ │
│ │ on big tensors) submits, GPU-resident │ │
│ │ KV cache) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ wgpu (WebGPU / Metal / Vulkan) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ WGSL kernels: matmul Q4_K/Q6_K/F16, rmsnorm, │ │
│ │ rmsnorm_per_row, rope_neox, attention (incl. │ │
│ │ HPD-f16 + block-local + subgroup variants), │ │
│ │ conv2d, geglu, softcap, residual_add, scale, │ │
│ │ top_k, quick_gelu, plus backward kernels for │ │
│ │ training (cross_entropy, rmsnorm, rope, geglu, │ │
│ │ attention dQ / dKV, matmul Q4_K / Q6_K, Adam) │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
│
▲ postMessage replies (tokens, errors)
PWA renders tokens, manages chat history, handles attachments.
The Worker move (M15) is what unblocked iPhone inference: iOS Safari only
exposes FileSystemSyncAccessHandle in Worker contexts, and the Worker
isolates inference from main-thread page-watchdog reapers.
The reference Go implementation lives in Ollama's tree under
model/models/gemma4/. Every op in crates/rullama/src/reference/forward.rs
(CPU oracle), forward_chained.rs (production GPU forward),
multimodal/vision.rs, and multimodal/audio.rs corresponds 1:1.
Measurements as of M15:
| Target | Steady-state tok/s (gen) | Notes |
|---|---|---|
| iPhone 16e (A18, iOS 26) | ~4.65 tok/s | text-only, max_context=512 |
| AMD Radeon Pro 555 (Mac) | ~1 tok/s (M7 baseline) | naive kernels, tiled matmul deferred |
The architectural foundation (chained encoder, GPU-resident KV cache, per-layer submits, per-tile range fetch from OPFS) is in place. Inference kernels are still naive matvec; reaching ≥10 tok/s on both Mac and phone needs tiled matmul + bind-group caching + kernel fusion (the M8 line on the roadmap).
The iPhone A18 advertises 1 GiB for both max_buffer_size and
max_storage_buffer_binding_size — four times the WebGPU spec floor — so
there's real headroom for fewer/larger weight buffers (currently 455 of
them resident, see M15 follow-ups).
Other capability notes captured during iPhone validation:
shader-f16✓ — packed FP16 MADs engage on A18.timestamp-query✓ — Pro 555 doesn't expose this; could wire GPU-side per-pass timing.subgroups✗ — A18 has SIMDgroup hardware but Safari's WebGPU doesn't surface WGSL subgroup ops yet. Vision attention falls through to the no-subgroup HPD-f16 kernel automatically.
crates/rullama/
├── src/
│ ├── api.rs # JS-facing Model: load / loadFromUrl / loadFromOpfs[TextOnly]
│ ├── backend/
│ │ ├── context.rs # WgpuCtx (device, queue, adapter limits)
│ │ ├── dispatch.rs # cached + chained kernel dispatchers (incl. backward + Adam)
│ │ ├── pipelines.rs # one ComputePipeline per kernel (built once)
│ │ ├── weight_cache.rs # lazy GPU upload, per-tile range fetch on big tensors
│ │ ├── matmul.rs / elementwise.rs / spike.rs # one-shot dispatchers (parity tests)
│ ├── gguf/
│ │ ├── reader.rs # GGUF v3 parser (header + tensor descriptors)
│ │ ├── fetcher.rs # TensorFetcher trait + In-memory / HttpRange / Opfs impls
│ │ ├── tensor.rs # dequant_tensor_to_f32 / dequant_row_to_f32 (sync + async)
│ │ ├── quant.rs / dtype.rs / value.rs
│ ├── kernels/wgsl/ # 70+ hand-written compute shaders (text + vision + audio + backward)
│ ├── model/config.rs # Gemma4Config: parses gemma4.* metadata keys
│ ├── multimodal/
│ │ ├── vision.rs # ViT forward (16 blocks, 768d, ClippableLinear)
│ │ ├── audio.rs # Conformer forward (12 blocks, 1024d, block-local attention)
│ │ └── audio_features.rs # WAV → 128-bin log-mel (realfft)
│ ├── reference/
│ │ ├── forward.rs # CPU f32 forward (parity oracle)
│ │ ├── forward_gpu.rs # M3-era GPU forward with per-kernel readbacks (oracle)
│ │ ├── forward_chained.rs # M7 production GPU forward, per-layer submits (M15)
│ │ ├── ops.rs / weights.rs
│ ├── sampling.rs # temperature, top-k, top-p, rep penalty
│ ├── template/gemma4_small.rs # chat-template renderer (matches Ollama)
│ └── tokenizer/ # GGUF BPE tokenizer (Ollama-bit-exact)
└── examples/
├── greedy_parity.rs # CPU forward greedy vs Ollama
├── chained_smoke.rs # standalone Forward driver
├── model_api.rs # public Model API end-to-end
├── vision_parity.rs # vision tower vs Ollama (M11)
├── audio_parity.rs # audio tower vs Ollama (M13)
├── matmul_bench.rs # native wgpu matmul microbench
└── inspect.rs / decode_ids.rs / encode_check.rs / list_tensors.rs / …
crates/rullama-finetune/
├── src/
│ ├── shared/ # vendored config / error / progress types
│ ├── dataset_loader.rs # JSONL parser + Tokenizer trait
│ ├── lr_schedule.rs # warmup + linear / cosine / cosine-warm-restarts
│ ├── lora.rs # per-LoRA GPU state (A / B), grad buffers
│ ├── scratch.rs # per-step GPU scratch buffers for backward
│ └── session.rs # TrainingSession — forward → loss → backward → Adam
└── examples/
├── overfit_one.rs # M3 acceptance test
├── train_jsonl.rs # JSONL dataset trainer
└── data/echo.jsonl
examples/
├── web/ # React + Vite + Tailwind + Workbox SW production demo
└── pwa/ # Vanilla JS bench harness + safaridriver scripts
├── index.html / bench.html
├── inference-worker.js # Dedicated Worker — owns Model + sync OPFS handle
├── opfs-store.js # OPFS download + read API (main-thread)
├── opfs-writer-worker.js # streams GGUF → OPFS via SyncAccessHandle.write
├── serve.sh # dev HTTPS server + /api/log /api/blob endpoints
├── run-on-iphone.sh / iphone-session-keeper.sh / clean-iphone.sh
└── bench-on-iphone.sh
tools/ios-bench/ # staticlib for Xcode — C-ABI rullama_run_bench
docker/ # nginx + R2 mirror configs
scripts/ # ops scripts (model upload, etc.)
Dual-licensed under either of:
- Apache License 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Contributions are accepted under the same dual-license terms.