Vibe-code tiny personal apps on your phone, by talking. No code ever shown.
You describe what you want β "a timer with my exact pour-over recipe", "a tracker for some thing only I care about" β an AI agent builds it against a small in-house SDK, and the result appears in your launcher as a runnable app. The thesis is the long tail of personal software: apps too small, too niche, or too personal to ever deserve a store listing. Whim collapses the effort to a sentence.
π¬ Demo video β coming soon. The sandbox runtime and version store are running on-device (screenshots below); an end-to-end recorded demo lands here once the next milestone is in.
This project exists to demonstrate harness engineering β the unglamorous machinery that makes LLM code generation reliable rather than impressive-once. The hard parts, in order of how much they fight back:
- Run untrusted, LLM-generated code on a phone, safely. Every mini-app is code nobody reviewed. It runs in a sandbox that is pen-tested, never-regress-CI-gated, and assumes the bundle is actively hostile β including assuming it will lie about its own containment.
- Govern what that code can touch. Storage and physical feedback (haptics, sound) aren't ambient capabilities β they're syscalls over an append-only registry with a fixed-order gate, reachable only from a channel a mini-app can't forge.
- Design an SDK for a model, not a human. A small, fully-documented component surface that fits in a system prompt, accepts semantic tokens instead of raw values, and makes hallucinated imports structurally impossible. Because apps only ever speak tokens, one user-chosen theme re-skins every app ever generated β retroactively.
- Version control nobody can see. Every generation is snapshotted with full history, rollback, pinning, and forking β backed by real git, on-device, with zero git vocabulary reaching the user (a build guard fails if a hash or ref leaks into the API surface).
- A self-healing generation loop (in progress): the wire contract, SSE streaming, and a stub Hono server are live; plan β generate β static-check β run β observe β repair still needs the static-check stage (proposed, not built) and a real model behind the harness β the quality of the structured diagnostics fed back, not the model, is what's meant to make this good.
The phone owns what's user-owned and must stay stable (apps, data, history, the runtime). The server owns what changes constantly (the harness, model access, checks, telemetry). The server is stateless β the device is the system of record.
flowchart LR
subgraph Phone["π± Phone β system of record"]
UI["Host app<br/>(launcher Β· prompt UI)"]
RT["Sandbox runtime<br/>(hardened WebView)"]
VS[("Version store<br/>(on-device git,<br/>product verbs only)")]
end
subgraph Server["βοΈ Server β stateless (skeleton live)"]
H["Generation harness<br/>SSE stub pipeline today;<br/>plan β generate β check<br/>β run β observe β repair (target)"]
end
M["LLM endpoint"]
UI -- "describe an app" --> H
H <--> M
H -- "verified bundle (IIFE)" --> RT
RT -- "snapshot every generation" --> VS
A mini-app is one TypeScript file that imports only from vc-sdk and exports a defineApp({...}) spec. esbuild turns it into a ~4.5 KiB IIFE; react, react-dom, and vc-sdk stay external and resolve to host-injected globals, so the resolvable module surface at runtime is exactly those three names β everything else throws.
Containment rests on three legs, and the pen-testing showed none is sufficient alone:
flowchart TB
subgraph RN["React Native host β trusted"]
subgraph WV["WebView outer document β trusted"]
subgraph IF["β cross-origin iframe β sandbox=allow-scripts, opaque origin"]
direction TB
NEU["neutralize.js β window-level value-strip<br/>fetch Β· XHR Β· WebSocket Β· RTCPeerConnection<br/>localStorage Β· indexedDB Β· Worker Β· sendBeacon"]
LD["trusted loader β holds nothing stronger<br/>than parent.postMessage"]
SDK["vc-sdk + one shared React instance"]
APP["π untrusted mini-app bundle"]
end
end
end
CSP["CSP: script-src without 'unsafe-eval'<br/>default-src 'none' Β· connect-src 'none'"] -.enforced on.-> IF
APP -- "render via vc-sdk only" --> SDK
IF -- "nonce-authenticated frames" --> WV
- The cross-origin iframe (no
allow-same-origin) denies all host/native reach βparent.document,top.location, the RN bridge are allSecurityError. - The CSP without
'unsafe-eval'is the only thing that closes the({}).constructor.constructor('β¦')codegen hole β every object reaches theFunctionconstructor through its prototype chain, so no amount of global-stripping can. Conversely,eval/Functionare not value-replaced: React's internals need them, and CSP kills codegen at the engine level anyway. Strip the capability, not the identifier. - The global value-strip covers what CSP can't β notably
RTCPeerConnection, since WebRTC ignoresconnect-src.
The adversarial suite assumes the worst finding from pen-testing (F4): a bundle shares the iframe scope with the loader and can forge its own "I'm contained" verdict. So the host never trusts a self-report β the verdict comes from closure-captured probes the bundle can't overwrite, and every iframeβhost control frame is authenticated with a per-realm nonce. Realms are recreated per generation, because an earlier generation can otherwise backdoor the next one through Object.prototype (confirmed on-device, now a regression test).
All of this is enforced by a never-regress invariant suite (npm run invariants) that runs as a blocking CI gate β including a deliberately-broken-CSP negative control, so the suite proves it isn't vacuously green.
Every generation is committed to a real git repository on the device (isomorphic-git under Hermes), one repo per mini-app. The public API speaks product verbs only β snapshot Β· history Β· diff Β· rollback Β· pin Β· fork β and a build-time guard fails if git vocabulary (a hash, a ref, a commit key) ever reaches a return shape. Since isomorphic-git has no gc, compaction is a DIY pack-then-drop-loose pass, triggered by loose-object count (the real pressure point on a KV-backed FS, not bytes).
Mini-apps don't get ambient access to storage, haptics, or sound β they reach host capabilities only through a governed syscall layer (an append-only registry, a fixed-order gate, a generation-fenced dispatcher) between the sandboxed iframe and the RN host. The syscall channel needs no nonce: a forged sysret posted by a bundle to its own window arrives with ev.source pointing at the iframe's own window, never window.parent β the browser sets source, so it's unforgeable by construction. Storage (schema-declared, per-app SQLite) was syscall #1; physical cues (haptics, sound) are #2 and #3, gated by manifest-declared capability tokens. An undeclared capability is denied with a structured error, never silently dropped.
Mini-apps never pick colors. Components accept semantic tokens (color="primary", radius="md"), and the token resolvers read the user's theme β six curated presets (light and dark) plus accent and corner-shape knobs, chosen in the launcher's settings and persisted on-device. The resolved theme crosses into the sandbox as inert JSON on the existing init frame β no new message kind, no CSP or resolver change β and is sanitized at the iframe boundary like any untrusted input (a hostile bundle mutating the theme global only mis-themes itself). The payoff of tokens-not-values: every app ever generated re-skins instantly, including snapshots made before theming existed. The component kit (forms, toggles, sliders, lists, cards, modal, progress β ~35 exports, deliberately under the system-prompt ceiling) is documented for the model in docs/sdk-reference.md and exercised end-to-end by a seeded Style Gallery app.
Everything below was measured on the real target β Android System WebView / Hermes, RN new architecture, offline release build β not desktop Chrome.
| What | Result |
|---|---|
| Containment probes (trusted vantage) | 42/42 pass, contained:true |
| Mini-app mount β first paint | ~119 ms cold Β· ~32 ms warm realm |
| Mini-app bundle size | ~4.5 KiB IIFE |
| Snapshot / rollback / fork | ~45β86 ms Β· ~58β183 ms Β· ~37β68 ms |
| Storage cost per generation | ~650 B + ~4 git objects |
| Compaction | 48 loose objects β 0; history/rollback/fork still resolve |
| Persistence across app kills | 3Γ kill+relaunch cycles, 0 corruption |
| Storage-engine writes (on-device) | update/remove ~1β11 ms Β· kv.set ~1β9 ms Β· single append ~1.2 ms warm |
| Capability-bridge round-trip (on-device) | ~16β17 ms median per syscall, every verb β transport-bound (2 WebViewβRN crossings), not engine-bound |
Left: the containment verdict rendered on-device. Right: the version store verifying its own snapshots across three app restarts.
The process is as much the portfolio piece as the code:
- Spike-driven de-risking. Every risky unknown (can a WebView contain a hostile bundle? does isomorphic-git run under Hermes? what's the bundle delivery channel?) got a throwaway spike with explicit hypotheses and an on-device verdict. Spike scaffolds are deleted; findings outlive them in
docs/. - A numbered decision log.
docs/decisions.mdrecords 43+ decisions with their rejected alternatives β including the reversals, kept on the record. - Adversarial verification. The bundle contract was pen-tested (T1βT8 + F4) before being productionized; the attacks that landed became carry-forward constraints, and the constraints became CI.
- Spec-driven changes. Work flows through OpenSpec proposals β design β tasks β archive, with capability specs as the source of truth.
- A raw devlog.
DEVLOG.mdcaptures the dead ends and "I was wrong about X" lessons before they evaporate. - An agentic build harness with adversarial self-checks. Most implementation work is dispatched to subagents over isolated git worktrees, gated by a pinned-commit integrity check (never a HEAD diff β once an agent can commit, that check is foldable) and a red/green check proving each test is non-vacuous before merge. Built first as a parallel batch-fix loop, generalizing next to the full OpenSpec build loop.
| β | Sandbox runtime (v0.1) β hardened WebView realm, bundle contract, nonce-authenticated verdicts, blocking-CI invariant suite |
| β | On-device version store (v0.2) β snapshot/history/rollback/pin/fork over isomorphic-git + MMKV, accepted on-device |
| β | Per-app storage engine (v0.2) β schema-declared SQLite, burned-ID columns, additive-only evolution, accepted on-device |
| β | Capability bridge β governed syscalls (storage, haptics/sound) over an append-only registry, accepted on-device |
| β | Effects & cues (v0.3) β web-resident timers + native haptic/sound feedback, accepted on-device |
| β | Launcher shell β home grid, full-screen launch, system-back exit, fork/delete, first-run seeding |
| β | SDK design system β themeable token contract (6 presets, accent/shape knobs, dark mode), ~35-export component kit, theme delivered into the sandbox as inert data; verified on-emulator (release build) |
| π | Generation harness β skeleton live (Hono server, SSE wire contract, durable token metering); the real planβgenerateβcheckβrunβrepair loop and a model behind it aren't wired yet |
| π | Static check pipeline β proposed, not yet built; closes the one open pen-test finding (token-scan checks miss prototype-pollution) |
| β³ | Voice input, iOS |
flowchart LR
P["Plan"] --> G["Generate"] --> S["Static check"] --> R["Run in sandbox"] --> O["Observe<br/>structured diagnostics"]
O -- "repair (β€3 attempts)" --> G
O -- "clean" --> D["Deliver + snapshot"]
build/ esbuild pipeline β mini-app bundles + the runtime HTML the WebView loads
contract/ @whim/contract β zod wire schemas shared by device and server
server/ @whim/server β Hono harness server skeleton (SSE generation, token metering)
src/runtime/ the WebView sandbox runtime (neutralize Β· resolver Β· probes Β· loader Β· syscall)
src/sdk/ vc-sdk β the private SDK mini-apps are written against
src/host/ RN shell β launcher, capability bridge, storage engine, version store
invariants/ never-regress containment suites (blocking CI gate)
fixtures/ sample mini-apps (incl. the Style Gallery showcase) + adversarial bundles that attack the sandbox
docs/ spec Β· numbered decision log Β· spike findings Β· build-harness design Β· prompt-ready SDK reference
openspec/ spec-driven change workflow (proposals β specs β archive)
npm install
npm run build # esbuild β runtime HTML + app bundles + artifacts
npm run invariants # the containment suite vs this exact build (headless Chromium)
npm run vstore:test # version-store acceptance suite (Node)
npm run storage:test # storage-engine acceptance suite (Node)
npm run bridge:test # capability-bridge acceptance suite (Node)
npm run launcher:test # launcher + theme acceptance suite (Node)Desktop Chromium is the fast pre-check; the authoritative verdict is the real Android WebView. To run on a device/emulator (Node 22, JDK 21): npm run android:release.

