Production LLM inference on the consumer GPUs you already own — at near-zero latency overhead vs. a single big GPU. A 70B model needs ~140 GB of VRAM. An RTX 4090 has 24. ComputePool shards a model layer-wise across two (or more) consumer cards, streams hidden states peer-to-peer over Gensyn AXL, settles via x402 + Superfluid on 0G Galileo testnet, and orchestrates everything through KeeperHub workflows.
The per-token network hop adds single-digit milliseconds on top of the actual forward pass — negligible next to the GPU compute itself, because AXL keeps the hidden-state transport peer-to-peer with no orchestrator round-trip.
Live on 0G Galileo testnet · Built for ETHGlobal 2026 · Hits all three sponsor tracks: 0G, Gensyn (AXL), KeeperHub.
| Demo | make build && make up → http://localhost:8000 |
| Pitch deck | frontend/app/pitch/page.tsx (also live at /pitch) |
| End-to-end script | scripts/e2e_demo.py |
Three deep-dive READMEs — start here:
- 0G ($15,000) —
sponsors/0g/README.md› Pooled-GPU SDK behind a TEE-attested orchestrator · Superfluid live on Galileo (chainId 16602) · ERC-7857 PoolINFT as access primitive - Gensyn — AXL ($5,000) —
sponsors/gensyn-axl/README.md› First production layer-pipelined LLM inference over AXL · prebuilt NVIDIA + CPU images with the AXL binary inside · native Tailscale in the same compose stack - KeeperHub ($5,000) —
sponsors/keeperhub/README.md› Two upstream PRs (#1106 Superfluid plugin, #1105 Coalition plugin) · five workflows drive the full x402 + Superfluid payment lifecycle
A single GPU can't fit a real model. We turn N small GPUs into one virtual one:
- Sharding:
worker/model.pyloads only the assigned layer slice from a HuggingFace model. Entry holdsembed + layers[0:mid], exit holdslayers[mid:N] + lm_head. - P2P transport:
worker/axl_client.py+worker/framing.pycarry hidden-state tensors and sampled tokens across Gensyn AXL frames. - Token loop:
worker/pipeline.py—entry_generateandexit_loopdrive one forward hop per token until EOS. Network overhead per token is negligible vs. the actual GPU compute — the AXL hop is single-digit ms on a Tailscale mesh, while a single transformer forward pass on a consumer card is tens of ms; sharded throughput tracks single-GPU throughput within noise. - Payments:
orchestrator/x402.pygates/pools/{n}/inferwith HTTP 402;orchestrator/economics.pyopens the Superfluid stream once the voucher settles. - Coordination:
orchestrator/keeperhub.pydrives five KeeperHub workflows for coalition activation, pool wiring, stream start/stop, and slashing. - Custody / iNFT:
orchestrator/inft/mints a per-pool ERC-7857 INFT on 0G with encrypted intelligence metadata.
| Sponsor | What we shipped on top | Track relevance |
|---|---|---|
| 0G | First CREATE2-deployed verified Superfluid contracts on 0G Galileo · pooled-GPU SDK lets consumer cards qualify for 0G Compute together · orchestrator runs in TEE so 0G's signing flow stays intact · live ERC-7857 INFT per pool | Best Agent Framework / Tooling and Best Autonomous Agents / iNFT |
| Gensyn — AXL | First production deployment of layer-pipelined LLM inference over AXL. Multi-node by construction (entry node ⇄ exit node). Prebuilt NVIDIA + CPU Docker images = one-line deploy. Tailscale-native — zero exposed ports. | Best Application of AXL — depth, multi-node, real utility |
| KeeperHub | Five workflows that drive the full demo · upstream Superfluid plugin (#1106) · upstream Coalition plugin (#1105) — multi-party on-chain commitments with slashing · agents pay autonomously via x402 | Best Innovative Use and Best Integration (payments + framework) |
Each sponsor folder contains the full breakdown — what we built, where it lives in the code, and how to verify it.
.
├── orchestrator/ FastAPI control plane + payments + INFT + KeeperHub client
│ ├── api/ /infer, /v1/chat/completions (OpenAI-compat)
│ ├── x402.py self-hosted x402 facilitator client
│ ├── economics.py Superfluid pool + stream lifecycle
│ ├── keeperhub.py KeeperHub MCP / JSON-RPC client
│ ├── inft/ ERC-7857 INFT mint + 0G-Storage encrypted metadata
│ └── tee/ SGX attestation + signer
├── worker/ P2P inference shard (entry or exit)
│ ├── model.py SplitModel — loads only assigned layer slice
│ ├── axl_client.py HTTP wrapper around the AXL daemon
│ ├── framing.py binary wire format (header + raw bf16 tensor)
│ └── pipeline.py entry_generate / exit_loop token loop
├── frontend/ Next.js 16 / React 19 dashboard + pitch deck
├── facilitator/ x402 facilitator (relayer for transferWithAuthorization)
├── contracts/ Foundry — PoolINFT.sol, deploy scripts
├── keeperhub/ Five workflow JSON exports (re-importable)
├── docker/ Multi-stage Dockerfile (Go for AXL + Python for app)
├── scripts/ e2e_demo.py, register_0g_provider.py, run-remote-worker.sh
└── sponsors/ ⭐ Per-sponsor deep-dive READMEs (start here for judging)
make build # multi-stage build: Go (AXL daemon) + Python (worker + orchestrator)
make up # start orchestrator + node-a + node-b
make ps # confirm three containers UpOpen http://localhost:8000, register a user, copy the API key, restart the workers with OWNER_API_KEY=<key> so they self-register, then from the dashboard: create a pool → add both nodes → initialize with a model → load → infer.
For the full payments flow (x402 + Superfluid + KeeperHub):
cp .env.example .env # fill 0G testnet keys + KH workflow IDs + Superfluid addresses
make build && make up
docker compose up -d facilitator
DEMO_PAYER_KEY=0x... python scripts/e2e_demo.pyTear down: make clean.
| Model | Layers | Split (entry / exit) |
|---|---|---|
meta-llama/Llama-3.2-1B |
16 | 0–7 / 8–15 |
meta-llama/Llama-3.2-3B |
28 | 0–13 / 14–27 |
Qwen/Qwen2.5-3B-Instruct |
36 | 0–17 / 18–35 |
Qwen/Qwen3-4B-Instruct-2507 |
36 | 0–17 / 18–35 |
To add a new model: append to MODEL_LAYERS in orchestrator/app.py.
Worker container env (full table in docs/):
| Variable | Required | Notes |
|---|---|---|
NODE_ID, WORKER_URL, ORCHESTRATOR_URL, OWNER_API_KEY |
yes | self-registration |
MODEL_NAME |
no | default Qwen/Qwen2.5-3B-Instruct |
PEER_HOST or PEER_ADDR |
yes | the other worker (single-worker mode is not supported) |
AXL_API_URL |
no | default http://localhost:9002 (in-container) |
Orchestrator reads MONGODB_URI, MONGODB_DB, plus 0G + Superfluid + x402 + KeeperHub env documented in .env.example.
All endpoints except /, /health, /api/models, /auth/* require X-API-Key.
| Method | Path | Purpose |
|---|---|---|
GET |
/ |
HTML dashboard |
GET |
/api/state |
JSON cluster snapshot |
POST |
/auth/register, /auth/login |
API-key issuance |
POST |
/nodes/register |
(called by workers on startup) |
* |
/pools/... |
full pool CRUD + initialize / load / unload / infer |
POST |
/v1/chat/completions |
OpenAI-compatible streaming inference (gated by x402) |
- Single in-flight generation per pool (no batching, no parallel requests in v1).
- KV cache lives in worker RAM; dropped on
unloador container restart. - Worker callbacks are unauthenticated — never expose worker ports publicly. AXL TLS handshake protects the P2P link itself but there is no peer allowlist yet.
- CPU-only by default — bundled
torchis the CPU wheel. Swap for the matching CUDA wheel inworker/requirements.txtfor GPU. - AXL port is overridden to
7001(away from the worker's7000); don't change it.
Architecture deep-dive: docs/
Sponsor judging packets: sponsors/