Just the agent, the weights, and the silicon. Serving WASM-sandboxed agents via CLI, HTTP API, TUI, or Web Debugger.
nanos is not a VM, and it is not a container. It is a kernel-level LLM and agent sandboxer.
By isolating only the agent execution logic in a lightweight sandbox while keeping the heavy resource states (model weights, GPU memory allocations) native to the host, nanos combines lightweight sandbox isolation with native hardware execution speeds.
Instead of virtualizing an operating system or containerizing a network stack, nanos acts as a microkernel for AI agents. The native host runtime acts as the kernel space—providing secure, audited access to files, networks, MCP tools, and GPU silicon (Apple Metal/CUDA)—while the agent executes inside a capability-isolated user-space WebAssembly (WASM) sandbox.
Rather than deploying agents as bloated virtual machines that talk to tools over HTTP, nanos executes tool calls via direct, in-memory Foreign Function Interface (FFI) pointer passing. The host and guest communicate through the WASM linear memory region exposed by the runtime, eliminating JSON serialization latency and local TCP socket overhead.
nanos integrates three capabilities that are rarely combined in open-source local agent runtimes:
| Property | What existed before | What nanos does |
|---|---|---|
| WASM process isolation | WasmEdge/LlamaEdge — but LLM runs as HTTP server | Agent logic isolated in WASM sandbox |
| Native Metal/CUDA GPU offload | llama.cpp bare-metal — but zero isolation | LLM inference runs natively on host GPU |
| In-process FFI tool syscalls | MCP over HTTP/stdio — network round-trip per call | Tools called via zero-copy memory pointer pass |
nanos integrates these capabilities into a single local runtime architecture.
Many current agent stacks trade isolation, latency, and memory efficiency for ease of integration. A typical stack looks like this:
Docker (200MB) → Python (2s boot) → pip install langchain (500MB) → MCP server (HTTP daemon) → LLM API (TCP socket, JSON serialize, wait, parse)
Every arrow represents latency, memory consumption, and a larger attack surface.
nanos throws out the entire stack:
nanos run agent.nano → WASM sandbox boots (< 50ms) → weights resident in GPU memory → tool calls via FFI pointer pass (zero copy) → done.
One runtime. No localhost tool daemons. No JSON RPC loops. No serialization tax.
- Capability-Isolated WASM Sandbox: Every agent runs inside a strict, metered
wasmtimestore with WASM linear memory isolation, fuel limits to prevent infinite loops, and strict memory caps. - Native Metal & CUDA GPU Offload: Model weights are loaded directly into Apple Metal or Linux CUDA graphics memory via native
llama.cpplayers (--features gpu-cuda). - Multi-Agent Fleet Orchestration: Orchestrate cooperative multi-agent fleets concurrently sharing a single
LlmEnginelocally via threads or across networks using distributed TCP message bus client/server connections. - Universal MCP Tool Proxy: Bridge standard Model Context Protocol (MCP) servers straight to WASM. Query tools, discover resources, pull prompts, and validate schemas dynamically.
- Time-Travel Visual Web Debugger: Inspect step execution traces, RAM consumption, tokens, and FFI latency. Click to edit observations or prompt variables, and launch divergent replays.
- Sandboxed JS/TS SDK Runtime: Write agents in TypeScript/JavaScript, compile them into WASM dynamic bundles via
nanos-compile.js, and execute them safely with dynamic host permission rules.
nanos achieves its unique combination of sandbox isolation and native GPU speed by using a Microkernel-inspired architecture. Instead of virtualizing the host hardware (like a VM or container), nanos virtualizes only the agent's application code space using WebAssembly (WASM).
The runtime is split into two strictly separated execution spaces:
- User Space (Guest Sandbox): This is where the agent logic runs. Guest code is compiled to WebAssembly (JS/TS agents compile along with an optimized QuickJS virtual machine into a single
.wasmbinary). The sandbox has zero native access to files, network, or hardware. - Kernel Space (Host Runtime): The native Rust engine. It compiles natively for your specific processor architecture (Apple Silicon ARM64, Linux x86_64, etc.) and has direct access to Apple Metal APIs, Linux CUDA drivers, and local filesystem/network resources.
In traditional agent stacks, tool execution requires local TCP loops, loopback routing, and HTTP JSON serialization. nanos treats tool calls like Operating System syscalls:
- Shared Memory: The host allocates a linear segment of RAM for the WASM guest sandbox. The host runtime can directly access the guest's WASM linear memory buffer through controlled runtime APIs, enabling direct linear-memory argument passing without localhost socket serialization.
- Pointer Passing: When the agent calls a tool like
fs.readFile("data.txt"), the WASM guest writes the path into its linear memory and executes an FFI syscall (nanos_fs_read(ptr, len)). - Instant Execution: The host intercepts the syscall, reads the arguments directly from the sandbox memory offset, validates the manifest permissions, executes the tool natively, writes the result back to WASM memory, and resumes guest execution.
- Zero-Copy Speed: This whole process completes in microseconds (< 1ms) because no network sockets are opened and no JSON serialization occurs.
Instead of compiling the matrix math of heavy LLM runtimes into WASM (which adds compiler layers and degrades performance), nanos keeps the inference engine native to the host:
- When the agent writes
llm.infer("..."), the WASM guest triggers an FFI syscall:llm_infer(prompt_ptr, prompt_len). - The Rust host reads the prompt from the shared WASM memory segment.
- The host passes the prompt to its native
LlmEngine(linked directly to the host's Apple Metal or CUDA drivers). - The GPU executes the generation natively (154 tokens/sec on Metal for Qwen 0.5B) and streams the generated response directly back into the guest's memory.
nanos isolates agent execution using WebAssembly linear memory sandboxes and explicit host capability permissions.
- Memory Isolation: The agent cannot access arbitrary host memory; it is confined to the WASM linear memory heap.
- Syscall Gatekeeping: The agent cannot invoke host syscalls directly. All requests must go through the FFI boundary.
- Explicit Whitelisting: Filesystem read/write and network access are disabled by default and must be explicitly whitelisted in the
.nanomanifest. - Resource Constraints: Executes under strict fuel (instruction count) limits and physical memory caps to prevent infinite loop resource exhaustion.
- Prompt-Injection-Driven File Access: If the agent is tricked by prompt injection into reading or writing system files, the host FFI gatekeeper blocks the request unless it matches the manifest's whitelist.
- Accidental System Modification: Bugs in agent code cannot modify files or execute arbitrary shell commands on the host.
- Runaway Agent Loops: Malicious or runaway loops are automatically halted when the guest runs out of allotted WASM fuel.
- Malicious native host extensions or compromises of the host process itself.
- Kernel-level compromises of the host OS.
- Hardware-level side-channel attacks (e.g., Spectre, Meltdown).
- Unpatched zero-day vulnerabilities inside the underlying WASM compilation runtime (wasmtime).
We chose WebAssembly as the compilation target for agents because it provides the exact primitives needed to build a secure, lightweight microkernel layer:
- Deterministic Memory: WASM linear memory layouts are strictly bounded, ensuring the guest program cannot read or write outside its allocated heap.
- Instruction Metering (Fuel): The runtime can account for guest execution steps and halt execution once a configurable fuel limit is exceeded, preventing infinite loops.
- Portable Compilation: Write your agent in TypeScript, JavaScript, Rust, or Go; they compile down to standard portable WASM bytecode.
- Sub-Millisecond Boot: Instantiating a WASM module is a simple host heap allocation rather than booting an operating system kernel.
Traditional container technologies (Docker, LXC, gVisor) isolate processes by virtualizing kernel namespaces and resource groups:
- Different Goals: Containers isolate complete software stacks (operating system libraries, daemons, virtual network bridges). nanos isolates only the agent execution logic while keeping the heavy resources (weights, GPU memory) shared natively on the host.
- Hardware Barrier: Containerization layers and hypervisors block direct access to proprietary local hardware interfaces (like macOS Metal and the Apple Neural Engine), forcing CPU emulation. By separating the sandbox from inference, nanos retains full hardware speed.
To make performance assertions verifiable and auditable, we include a fully automated, reproducible benchmarking harness in the repository. This allows developers to test raw performance on their own local machines.
Our benchmark harness evaluates both CPU and GPU performance across key parameters:
- Generation Throughput: Measures token generation latency and speed (tokens generated per second of decoding).
- Prompt Evaluation Speed: Measures processing speed for initial context prompts (tokens evaluated per second).
- Sandbox Boot Latency: Measures cold-start boot time from binary loading to FFI readiness.
- Memory Footprint: Measures Resident Set Size (RSS) memory consumption on the host (excluding loaded LLM weights to isolate runtime overhead).
Measured on Apple M1 Pro (16 GB Unified Memory), qwen2.5-coder:0.5b:
| Metric | Host (Native GPU Acceleration) | Docker Container (Hypervisor CPU) | Comparison / Speedup |
|---|---|---|---|
| Generation Throughput | 155.47 tok/sec | 15.86 tok/sec | 9.80x faster |
| Prompt Eval Speed | 3173.58 tok/sec | 911.78 tok/sec | 3.48x faster |
| Sandbox Boot Latency | < 3 ms (WASM Instantiation) | ~1,500 - 5,000 ms (VM Container boot) | > 500x faster |
| System Memory Footprint | ~20 MB RAM (Wasmtime sandbox) | >= 2,000 MB RAM (Linux VM Hypervisor) | 100x lighter |
Anyone can reproduce and audit these metrics by executing:
bash benchmarks/run_benchmark.shThis script automatically spins up an Ollama Docker container, downloads the qwen2.5-coder:0.5b model, runs the python test script across multiple iterations to calculate averages, outputs a detailed markdown report, and cleans up all Docker resources.
To foster technical trust, we outline the exact systems boundaries where these numbers apply:
- Where
nanosis Superior: On developer workstations (macOS/Windows), Docker runs inside a virtual machine hypervisor which does not support GPU pass-through (Apple Metal or direct Windows DirectX) to guest containers. The models are restricted to CPU execution (making Docker ~10x slower).nanosexecutes natively on host hardware with zero VM overhead. - Where They Tie: On CPU-only cloud instances (e.g. AWS EC2 with no GPUs), both systems run Ollama on CPU, yielding equivalent inference speeds.
- Where Docker is Equal/Superior: On Linux servers with dedicated NVIDIA GPUs where the container is executed with native GPU passthrough (
docker run --gpus all), Docker achieves 100% native hardware speed. Additionally, for tasks involving continuous multi-megabyte binary data transfers across the FFI boundary, WASM memory copies can introduce small latencies compared to native process serialization.
nanos is packaged as a single, compiled binary that manages everything from local runs to multi-agent fleets and network services.
# General usage structure
nanos <COMMAND> [OPTIONS]| Command | Description | Example Usage |
|---|---|---|
run |
Run a single AI agent from a .nano manifest |
nanos run examples/agent.nano |
serve |
Serve the agent runtime background daemon and Visual Web Debugger over HTTP | nanos serve --port 8080 |
orchestrate |
Orchestrate cooperative multi-agent fleets locally or as a TCP server | nanos orchestrate examples/fleet.nano --network --port 9090 |
node |
Connect a remote fleet node client back to the distributed server orchestrator | nanos node --connect 127.0.0.1:9090 --name writer |
dashboard |
Launch the real-time TUI dashboard and Time-Travel debug console | nanos dashboard examples/fleet.nano |
bench |
Run a native FFI latency benchmark against the LLM model | nanos bench examples/agent.nano |
Ensure you have the following installed on your host:
- Rust & Cargo (MSRV 1.75+)
- Node.js (v18+ for compiling, v20+ for the JS sandbox runner)
- Ollama running locally. Pull the model before running:
ollama pull qwen2.5-coder:1.5b
Clone and compile the native runtime binary:
git clone https://github.com/PandiaJason/nanos && cd nanos
cargo build --releaseBuild the default Rust agent core into WebAssembly:
# Compile core agent to WASM target
cd nanos-core-agent && cargo build --target wasm32-unknown-unknown && cd ..
# Setup example file and execute
cp examples/instruction.txt .
./target/release/nanos run examples/agent.nanoUse the custom compiler toolchain and TypeScript SDK (nanos-sdk) to bundle your TS scripts into secure WebAssembly.
import { fs, llm, agent } from '../nanos-sdk/index.js';
export async function run() {
console.log("TS Agent started!");
const goal = await agent.getGoal();
const inputData = await fs.readFile("instruction.txt");
const response = await llm.infer(`Summarize code: ${inputData}`);
await fs.writeFile("secret.txt", response);
await agent.done("TS FFI Loop completed successfully.");
}
run().catch(err => {
console.error("TS Agent execution failed:", err);
process.exit(1);
});# Compile TS to WASM
node nanos-sdk/bin/nanos-compile.js examples/test_agent.ts --out dist/test_agent.wasm --engine bundle
# Run under the sandbox manifest configuration
./target/release/nanos run examples/agent_js.nanoExpose nanos as an HTTP daemon and launch the premium visual dashboard companion:
./target/release/nanos serve --port 8080 --host 127.0.0.1Open http://localhost:8080 in your browser. Inspect running statuses, step latencies, peak memory consumption, and click on any step to trigger a Time-Travel Divergent Replay!
To demonstrate nanos in action, the following pre-configured examples are provided in the examples/ directory:
- Host Capability Sandboxing: Attempts unauthorized filesystem reads (e.g.,
Cargo.toml) and unauthorized external network calls to verify sandbox isolation boundaries.# Compile the TS agent to WASM node nanos-sdk/bin/nanos-compile.js examples/security_violation.ts --out dist/security_violation.wasm --engine bundle # Execute the agent and verify the host blocks the unauthorized FFI calls ./target/release/nanos run examples/security_violation.nano
- TypeScript MCP Tool Calls: Invokes external Model Context Protocol tools from TypeScript via the host's
mcp_callFFI system call.# Compile the TS agent to WASM node nanos-sdk/bin/nanos-compile.js examples/mcp_server_caller.ts --out dist/mcp_server_caller.wasm --engine bundle # Run the agent against the ping-server MCP server ./target/release/nanos run examples/mcp_server_caller.nano
- Multi-Agent Fleet Orchestration: Runs two cooperative agents (
researcherandwriter) communicating over a shared memory message bus or TCP network.# Run locally via threads ./target/release/nanos orchestrate examples/fleet_orchestrator.nano
Every agent is defined by a .nano YAML configuration file:
name: "nanos-js-agent" # Name of the agent instance
model:
provider: "ollama" # LLM Provider: 'ollama' | 'openai' | 'local' (native GGUF)
model_name: "qwen2.5-coder:0.5b" # Model name (for ollama/openai)
path: "models/qwen.gguf" # GGUF local model path (required for 'local' provider)
context_window: 4096 # Context size limit
api_url: "http://..." # Custom API URL (optional)
api_key: "sk-..." # Custom API Key (optional)
resources:
memory: "512MB" # Sandbox physical RAM heap limit
max_steps: 10 # Maximum FFI syscall loop iterations allowed
permissions:
fs_read: # Whitelist of files or glob patterns the agent can read
- "instruction.txt"
fs_write: # Whitelist of files or glob patterns the agent can write
- "secret.txt"
network: false # Disable or enable external TCP socket access
mcp_servers: # Whitelist of external Model Context Protocol stdio servers
- name: "ping-server"
command: "node"
args:
- "path/to/server.js"
tools: # List of tools permitted for the agent (e.g. fs_read, fs_write, mcp_call, done)
- "fs_read"
- "fs_write"
- "mcp_call"
binary: "dist/test_agent.wasm" # Target agent compilation binary
goal: "Extract the secret..." # Mission statement of the agentFor the complete JSON-RPC FFI Protocol specification, see the FFI Specification Document and the low-level WASM Syscall ABI Document.
Unlike WebAssembly projects like LlamaEdge or WasmEdge which package the LLM itself into WASM to expose it as an HTTP web server, nanos focuses entirely on sandboxing the agent logic while letting inference run natively on host silicon.
| Dimension | LlamaEdge / WasmEdge | nanos |
|---|---|---|
| Core Paradigm | "LLM-as-a-Service" (Web Server Model) | "Microkernel OS" (In-Process Syscall Model) |
| Interface Boundary | Localhost HTTP REST Sockets (JSON-RPC) | Memory Boundary (Direct FFI Pointer Passing) |
| Agent / LLM Relation | Agent runs on the host, querying the LLM running inside WasmEdge over HTTP. | Agent runs inside the WASM sandbox, calling the host LLM via in-process FFI. |
| Tool Execution Latency | ~348ms (TCP stack, serialization, loopback routing) | < 1ms (Zero-copy memory pointer sharing) |
| Target Use Case | Serving LLMs as isolated cloud web backends. | Executing local, secure, low-latency AI agents. |
-
The Web Server Model (LlamaEdge):
+------------+ +------------------+ +-------------+ | Host Agent | --(HTTP/JSON)--> | LlamaEdge WASM | --(WASI-NN API)--> | host GPU/C+ | | (Py / JS) | <-- (REST API) --| (HTTP Server) | +-------------+ +------------+ +------------------+Every step of the agent's action loop requires network translation, JSON parsing, and HTTP overhead.
-
The Microkernel Syscall Model (nanos):
+-------------------------------------------------------+ | NANOS PROCESS | | | | +---------------------+ | | | WASM Agent Sandbox | (User Space Agent) | | +----------+----------+ | | | | | | In-Process FFI Pointer Pass (`llm_infer`)| | v | | +---------------------+ | | | Rust Host | (Kernel Space Services) | | | (Metal/CUDA/Tool) | | | +---------------------+ | +-------------------------------------------------------+The agent logic is isolated in user space, but LLM inference and tool execution run in kernel space on native host bindings. The boundary is crossed in microseconds via direct pointer passing, completely bypassing loopback network stacks.
To remain transparent, nanos acknowledges the following boundaries in its current release:
- Trusted GPU Inference: GPU inference remains natively compiled host-side trusted code rather than sandboxed execution.
- Warmup Dependencies: Sandbox startup is instantaneous, but initial model loading latency is dictated by the backend native inference engine.
- Capability vs. Kernel Enforcement: Network isolation is governed by host-side permission checks rather than operating-system-level network namespace virtualization.
- External MCP Servers: Any configured external MCP stdio servers and native plugins belong to the trusted computing base.
| Feature | nanos |
E2B | LangChain | Docker + Python |
|---|---|---|---|---|
| Cold Start | < 3ms | ~2s | ~3s | ~30s |
| RAM Overhead | ~39MB | ~200MB | ~500MB | ~450MB |
| Sandbox | WASM process-isolated | Cloud VM container | None | Host container |
| GPU Access | Direct Metal / CUDA | None | None | Manual configuration |
| Air-Gapped | Yes | No (Cloud only) | No | Partial |
| Binary Size | Single ~23MB binary | N/A | pip install |
docker pull |
If you find this project valuable, please consider giving it a star on GitHub!

