fix: replace ollama ExternalName with ClusterIP+Endpoints for Gateway API by bussyjd · Pull Request #228 · ObolNetwork/obol-stack

bussyjd · 2026-02-26T17:17:49Z

Summary

Replace the ollama ExternalName Service in the llm namespace with ClusterIP + Endpoints for Traefik Gateway API compatibility
ExternalName services are rejected as HTTPRoute backends by Traefik (type ExternalName is not supported)
Add ollamaHostIPForBackend() to resolve the host IP at obol stack init time
On macOS: resolves host.docker.internal via DNS
On Linux k3d: falls back to docker0 bridge interface IP (host.k3d.internal only resolves inside k3d's CoreDNS)
Remove Go template conditionals ({{- if x402Enabled }}) from helmfile.yaml — breaks Helmfile v1 (requires .gotmpl extension). x402 ForwardAuth middleware is now always deployed (verifier returns 200 for unmatched routes)

Tested on

macOS (Docker Desktop): host.docker.internal resolves to Docker Desktop gateway IP
Linux (Ubuntu 24.04, Docker 27.3.1, k3d): docker0 interface IP (172.17.0.1) used as fallback. Full inference verified: llmspy → ollama ClusterIP → Endpoints 172.17.0.1:11434 → host Ollama → qwen3:8b response ✓

Test plan

go test ./internal/stack/ -run TestOllamaHostIP — passes on both macOS and Linux
go test ./internal/stack/ -run TestDockerBridge — passes on Linux (skips on macOS)
Fresh obol stack init + obol stack up on Linux — ClusterIP+Endpoints with 172.17.0.1 deployed
kubectl run curl test: ollama.llm.svc.cluster.local:11434 returns "Ollama is running"
Full inference through llmspy → ollama ClusterIP → qwen3:8b responds correctly
Verify existing macOS clusters continue working after upgrade

… API Traefik's Gateway API controller rejects ExternalName services as HTTPRoute backends, causing 500 errors after valid x402 payment (ForwardAuth passes but Traefik can't proxy to the backend). Replace the ExternalName ollama service with a ClusterIP service paired with a manual Endpoints object. The endpoint IP is resolved at `obol stack init` time via a new {{OLLAMA_HOST_IP}} placeholder: - k3s: 127.0.0.1 (already an IP, no resolution needed) - k3d on macOS: net.LookupHost("host.docker.internal"), fallback 192.168.65.254 - k3d on Linux: net.LookupHost("host.k3d.internal"), fallback 127.0.0.1 The existing {{OLLAMA_HOST}} placeholder is preserved for backward compatibility with other consumers.

On Linux, host.k3d.internal only resolves inside k3d's CoreDNS, not on the host machine. ollamaHostIPForBackend() now falls back to the docker0 bridge interface IP (typically 172.17.0.1) which is reachable from all Docker containers regardless of their network. Resolution strategy: 1. If already an IP (k3s), return as-is 2. Try DNS resolution (works on macOS Docker Desktop) 3. On Linux k3d, fall back to docker0 interface IP

host.docker.internal is only in Docker's DNS, not the macOS host's. PR #228 (ClusterIP+Endpoints) requires an IP at init time, which broke `obol stack init` on macOS. Add dockerResolveHost() that runs `docker run --rm alpine nslookup <hostname>` as a fallback between host-side DNS and the Linux docker0 bridge.

* Add pre-flight port check before cluster creation When `obol stack up` creates a new cluster, k3d tries to bind host ports 80, 8080, 443, and 8443. If any are already in use, Docker fails with a cryptic error and rolls back the entire cluster. Add a `checkPortsAvailable()` pre-flight check that probes each required port with `net.Listen` before invoking k3d. On conflict, the error message lists the blocked port(s) and shows a `sudo lsof` command to identify the offending process. * Track llmspy image releases via Renovate Add custom regex manager to detect new ObolNetwork/llms releases and auto-bump the image tag in llm.yaml. Follows the same pattern used for obol-stack-front-end and OpenClaw version tracking. * Replace hardcoded gpt-oss:120b-cloud with dynamic Ollama model detection The default model gpt-oss:120b-cloud does not exist and caused OpenClaw to deploy with a non-functional model configuration. Instead, query the host's Ollama server for actually available models and use those in the overlay. When no models are pulled, deploy with an empty model list and guide users to `obol model setup` or `ollama pull`. * Add obol-stack-dev skill, integration tests, and README updates - Add `obol-stack-dev` skill with full reference docs for LLM smart-routing through llmspy (architecture, CLI wrappers, overlay generation, integration testing, troubleshooting) - Add integration tests (`//go:build integration`) that deploy 3 OpenClaw instances through obol CLI verbs and validate inference through Ollama, Anthropic, and OpenAI via llmspy - Expand README model providers section and add OpenClaw commands * feat(enclave): add Secure Enclave key management package Implements internal/enclave — a CGO bridge to Apple Security.framework providing hardware-backed P-256 key management for macOS Secure Enclave. Key capabilities: - NewKey/LoadKey: generate or retrieve SE-backed P-256 keys persisted in the macOS keychain (kSecAttrTokenIDSecureEnclave); falls back to an ephemeral in-process key when the binary lacks keychain entitlements (e.g. unsigned test binaries) - Sign: ECDSA-SHA256 via SecKeyCreateSignature — private key never leaves the Secure Enclave co-processor - ECDH: raw shared-secret exchange via SecKeyCopyKeyExchangeResult - Encrypt/Decrypt: ECIES using ephemeral ECDH + HKDF-SHA256 + AES-256-GCM Wire format: [1:version][65:ephPubKey][12:nonce][ciphertext+16:GCM-tag] - CheckSIP: verify System Integrity Protection is active via sysctl kern.csr_active_config; treats absent sysctl (macOS 26/Apple Silicon) as SIP fully enabled (hardware-enforced) Platform coverage: - darwin + cgo: full Security.framework implementation - all other platforms: stubs returning ErrNotSupported so the module builds cross-platform without conditional compilation at call sites Tests cover: key generation, load, sign, ECIES round-trip, tamper detection, idempotent NewKey, and SIP check. TestLoadKey / TestNewKeyIdempotent skip gracefully when running as an unsigned binary. * feat(inference): wire Secure Enclave into x402 gateway Adds SE-backed request encryption to the inference gateway, closing parity with ecloud's JWE-encrypted deployment secrets — applied here at the per-request level rather than deploy-time only. Changes: - internal/inference/enclave_middleware.go New HTTP middleware (enclaveMiddleware) that: • Decrypts Content-Type: application/x-obol-encrypted request bodies using the SE private key (ECIES-P256-HKDF-SHA256-AES256GCM) • Reconstructs the request as plain application/json before proxying • If X-Obol-Reply-Pubkey header present, encrypts the upstream response back to the client's ephemeral key (end-to-end confidentiality) • Exposes handlePubkey() for GET /v1/enclave/pubkey - internal/inference/gateway.go • New GatewayConfig.EnclaveTag field (empty = plaintext mode, backward compatible) • Registers GET /v1/enclave/pubkey when EnclaveTag is set • Stacks layers: upstream → SE decrypt → x402 payment → client (operator sees only that a paid request arrived, never its content) - cmd/obol/inference.go • --enclave-tag / -e / $OBOL_ENCLAVE_TAG flag on obol inference serve • New obol inference pubkey <tag> subcommand: prints or JSON-dumps the SE public key — equivalent to `ecloud compute app info` for identity - internal/inference/enclave_middleware_test.go Tests: pubkey JSON shape, encrypted response round-trip, plaintext passthrough, gateway construction with EnclaveTag. * feat(inference): add deployment lifecycle commands (ecloud parity) Implements a persistent inference deployment store and full lifecycle CLI mirroring ecloud's 'compute app' surface: ecloud compute app deploy → obol inference create / deploy ecloud compute app list → obol inference list ecloud compute app info → obol inference info ecloud compute app terminate → obol inference delete ecloud compute app info pubkey → obol inference pubkey internal/inference/store.go: - Deployment struct: name, enclave_tag, listen_addr, upstream_url, wallet_address, price_per_request, chain, facilitator_url, timestamps - Store: Create (with defaults + force flag), Get, List, Update, Delete - Persisted at ~/.config/obol/inference/<name>/config.json (mode 0600) - EnclaveTag auto-derived: "com.obol.inference.<name>" if not set cmd/obol/inference.go (rewrites inference.go): obol inference create <name> — register deployment config obol inference deploy <name> — create-or-update + start gateway obol inference list — tabular or JSON listing obol inference info <name> — config + SE pubkey (--json) obol inference delete <name> — remove config (--purge-key also removes SE key from keychain) obol inference pubkey <name> — resolve name → tag → SE pubkey obol inference serve — low-level inline gateway (no store) All commands accept --json flag for machine-readable output. * feat(inference): add cross-platform client SDK for SE gateway Extract pure-Go ECIES (encrypt + deriveKey) from enclave_darwin.go into enclave/ecies.go so the encryption half is available without CGO or Darwin. Add inference.Client — an http.RoundTripper that: - Fetches and caches the gateway's SE public key from GET /v1/enclave/pubkey - Transparently encrypts request bodies (ECIES) before forwarding - Optionally attaches X-Obol-Reply-Pubkey for end-to-end encrypted responses - Decrypts encrypted responses when EnableEncryptedReplies is active Mirrors ecloud's encryptRSAOAEPAndAES256GCM client pattern but for live per-request encryption rather than deploy-time secret encryption. * fix(inference): address P0/P1/P2 review findings P0 — Duplicate flag panic on deploy/serve --help: --force moved to create-only; deploy uses deployFlags() only. --wallet duplicate in serve eliminated (deployFlags() already defines it). P1 — Encrypted reply Content-Length mismatch: After encrypting upstream response, refresh Content-Length to encrypted body size and clear Content-Encoding/ETag before writing headers. P1 — SIP not enforced at runtime: gateway.Start() now calls enclave.CheckSIP() before initialising enclaveMiddleware when EnclaveTag is set; refuses to start if SIP disabled. P2 — applyFlags overwrites existing config with flag defaults: Switch from c.String(...) to c.IsSet(...) guard so only flags the user explicitly set are merged into the stored Deployment. P2 — Shallow middleware test coverage: Replace placeholder tests with five real wrapper-path tests covering pubkey endpoint shape, encrypted-request decrypt, plaintext passthrough, encrypted-reply header refresh (Content-Length/Content-Encoding/ETag), and invalid reply pubkey rejection. Add CLI regression tests (inference_test.go): deploy --help and serve --help no-panic checks, serve wallet-required guard, applyFlags explicit-only mutation invariant. * feat(inference): add Apple Containerization VM mode + fix security doc claims Container integration (apple/container v0.9.0): - internal/inference/container.go: ContainerManager wraps `container` CLI to start/stop Ollama in an isolated Linux micro-VM; polls Ollama health endpoint before gateway accepts requests - internal/inference/store.go: add VMMode, VMImage, VMCPUs, VMMemoryMB, VMHostPort fields to Deployment - internal/inference/gateway.go: start ContainerManager on Start() when VMMode=true, override UpstreamURL to container's localhost-mapped port, stop container on Stop(); fix misleading operator-can't-read comment - cmd/obol/inference.go: add --vm, --vm-image, --vm-cpus, --vm-memory, --vm-host-port flags; wire through applyFlags and runGateway Doc fixes: - plans/pitch-diagrams.md: correct Diagram 1 (transit encryption not operator-blind), Diagram 5 (SIP blocks external attackers not operator), Diagram 7 (competitive matrix: Phase 1.5a at [0.85,0.20] not [0.85,0.88]) * fix(inference): fix wallet flag parsing + support --name flag Two issues fixed: 1. applyFlags used c.IsSet("wallet") which could return false even when --wallet was explicitly passed; changed to non-empty check for flags that have no meaningful empty default (wallet, enclave-tag). 2. urfave/cli v2 stops flag parsing at the first positional arg, so `deploy test-vm --wallet addr` silently ignored the wallet flag. Fixed by adding a --name/-n flag to deployFlags() as an alternative to the positional argument. Users can now use either: obol inference deploy --wallet <addr> [flags] <name> obol inference deploy --name <name> --wallet <addr> [flags] Added wallet validation before store.Create to prevent writing bad configs. Tested end-to-end: VM mode container starts, Ollama becomes ready in ~2s (cached image), gateway serves /health 200 and /v1/chat/completions 402. * feat(inference): stream container image pull progress Previously `container run --detach` silently pulled the image inline, causing a 26-minute silent wait on first run with no user feedback. Now runs an explicit `container pull <image>` with stdout/stderr wired to the terminal before starting the container, so users see live download progress. On cache hit the pull completes in milliseconds. * chore(deps): migrate urfave/cli v2.27.7 → v3.6.2 Breaking changes applied across all cmd/obol files: - cli.App{} → cli.Command{} (top-level app is now a Command) - All Action signatures: func(*cli.Context) error → func(context.Context, *cli.Command) error - All Subcommands: → Commands: - EnvVars: []string{...} → Sources: cli.EnvVars(...) (X402_WALLET, OBOL_ENCLAVE_TAG, CLOUDFLARE_*, LLM_API_KEY) - cli.AppHelpTemplate → cli.RootCommandHelpTemplate - app.Run(os.Args) → app.Run(context.Background(), os.Args) - All c.XXX() accessor calls → cmd.XXX() (~70 occurrences) - cmd.Int() now returns int64; added casts for VMCPUs, VMMemoryMB, VMHostPort, openclaw dashboard port - Passthrough command local var renamed cmd → proc to avoid shadowing the *cli.Command action parameter - inference_test.go: rewrote deployContext() — cli.NewContext removed in v3; new impl runs a real *cli.Command and captures parsed state Removed v2 transitive deps: go-md2man, blackfriday, smetrics. * chore: ignore plans/ directory (kept local, not for public repo) * docs(claude): update CLAUDE.md for cli v3 migration + inference gateway - Fix CLI framework reference: urfave/cli/v2 → v3 - Update passthrough command example to v3 Action signature (context.Context, *cli.Command) - Fix go.mod dependency listing - Expand inference command tree (create/deploy/list/info/delete/pubkey/serve) - Add Inference Gateway section: architecture, deployment lifecycle, SE integration, VM mode, flag patterns - Add inference/enclave key files to References * feat(obolup): add Apple container CLI installation (VM inference support) Adds install_container() that downloads and installs the signed pkg from github.com/apple/container releases. macOS-only, non-blocking (failure continues with a warning). Pins CONTAINER_VERSION=0.9.0. Enables 'obol inference deploy --vm' for running Ollama in an isolated Apple Containerization Linux micro-VM. * test(inference): add Layer 2 gateway integration tests with mock facilitator Extracts buildHandler() from Start() so tests can inject the handler into an httptest.Server without requiring a real network listener. Adds VerifyOnly to GatewayConfig to skip on-chain settlement in staging/test environments. gateway_test.go implements a minimal mock facilitator (httptest.Server with /supported, /verify, /settle endpoints and atomic call counters) and covers: - Health check (no payment required) - Missing X-PAYMENT header → 402 - Valid payment → verify + settle → 200 - VerifyOnly=true → verify only, no settle → 200 - Facilitator rejects payment → 402, no settle - Upstream down → verify passes, proxy fails → 502 - GET /v1/models without payment → 402 - GET /v1/models with payment → 200 * docs(plans): add phase-2b linux TEE plan + export context * feat(tee): add Linux TEE scaffold with stub backend (Phase 2b Steps 1-3) Introduce internal/tee/ package providing a hardware-agnostic TEE key and attestation API that mirrors the macOS Secure Enclave interface. The stub backend enables full integration testing on any platform without requiring TDX/SNP/Nitro hardware. - internal/tee/: key management, ECIES decrypt, attestation reports, user_data binding (SHA256(pubkey||modelHash)), verification helpers - Gateway: TEE vs SE key selection, GET /v1/attestation endpoint - Store: TEEType + ModelHash fields on Deployment - CLI: --tee and --model-hash flags on create/deploy/serve/info/pubkey - Tests: 14 tee unit tests + 4 gateway TEE integration tests * feat(tee): ground TEE backends with real attestation libraries Replace TODO placeholders with real library calls for all three TEE backends, anchoring the code to actual APIs that compile and can be verified on hardware later. Attest backends (behind build tags, not compiled by default): - SNP: github.com/google/go-sev-guest/client — GetQuoteProvider() + GetRawQuote() via /dev/sev-guest or configfs-tsm - TDX: github.com/google/go-tdx-guest/client — GetQuoteProvider() + GetRawQuote() via /dev/tdx-guest or configfs-tsm - Nitro: github.com/hf/nsm — OpenDefaultSession() + Send(Attestation) via /dev/nsm with COSE_Sign1 attestation documents Verify functions (no build tag, compiles everywhere): - VerifySNP: go-sev-guest/verify + validate (VCEK cert chain, ECDSA-P384) - VerifyTDX: go-tdx-guest/verify + validate (DCAP PCK chain, ECDSA-256) - VerifyNitro: hf/nitrite (COSE/CBOR, AWS Nitro Root CA G1) - ExtractUserData: auto-detects SNP (1184 bytes), TDX (v4 + 0x81), Nitro (CBOR tag 0xD2), and stub (JSON) formats Tests: 22 passing (14 existing + 8 new verification surface tests) * feat(tee): add CoCo pod spec + QEMU dev integration tests (Phase 2b Steps 8-9) Add Confidential Containers (CoCo) support to inference templates and integration tests for QEMU dev mode verification on bare-metal k3s. Pod templates: - Conditional runtimeClassName on both Ollama and gateway Deployments - TEE args/env vars passed to gateway container (--tee, --model-hash) - TEE metadata in discovery ConfigMap for frontend visibility - New values: teeRuntime, teeType, teeModelHash with CLI annotations CoCo helper (internal/tee/coco.go): - InstallCoCo/UninstallCoCo via Helm with k3s-specific flags - CheckCoCo returns operator status, runtime classes, KVM availability - ParseCoCoRuntime validates kata-qemu-coco-dev/snp/tdx runtime names Integration tests (go:build integration): - CoCo operator install verification - RuntimeClass existence check - Pod deployment with kata-qemu-coco-dev + kernel isolation proof - Inference gateway attestation from inside CoCo VM * docs(tee): add Phase 2b session transcript export * feat(x402): add ForwardAuth verifier service for per-route micropayments Standalone x402 payment verification service designed for Traefik ForwardAuth. Enables monetising any HTTP route (RPC, inference, etc.) via x402 micropayments without modifying backend services. Components: - internal/x402: config loading, route pattern matching (exact/prefix/glob), ForwardAuth handler reusing mark3labs/x402-go middleware, poll-based config watcher for hot-reload - cmd/x402-verifier: standalone binary with signal handling + graceful shutdown - x402.yaml: K8s resources (Namespace, ConfigMap, Secret, Deployment, Service) * feat(x402): add ERC-8004 client, on-chain registration, and x402 payment gating - Add internal/erc8004 package: Go client for ERC-8004 Identity Registry on Base Sepolia using bind.NewBoundContract (register, setAgentURI, setMetadata, getMetadata, tokenURI, wallet functions) - ABI verified against canonical erc-8004-contracts R&D sources with all 3 register() overloads, agent wallet functions, and events (Registered, URIUpdated, MetadataSet) - Types match ERC-8004 spec: AgentRegistration with image, supportedTrust; ServiceDef with version; OnChainReg with numeric agentId - Add x402 CLI commands: obol x402 register/setup/status - Add well-known endpoint on x402 verifier (/.well-known/agent-registration.json) - Add conditional x402 Middleware CRD + ExtensionRef in infrastructure helmfile - Add x402Enabled flag to inference network template (values + helmfile + gateway) - Add go-ethereum v1.17.0 dependency * test: add x402 and ERC-8004 unit test coverage Add comprehensive unit tests for the x402 payment verification and ERC-8004 on-chain registration subsystems: - x402 config loading, chain resolution, and facilitator URL validation - x402 verifier ForwardAuth handler and route matching - x402 config file watcher polling logic - ERC-8004 ABI encoding/decoding roundtrips - ERC-8004 client type serialization and agent registration structs - x402 test plan document covering all verification scenarios * security: fix injection, fail-open, key exposure, and wallet validation Address 4 vulnerabilities found during security review: HIGH — YAML/JSON injection in setup.go: Replace fmt.Sprintf string interpolation with json.Marshal/yaml.Marshal for all user-supplied values (wallet, chain, route configs). MEDIUM — ForwardAuth fail-open: Change empty X-Forwarded-Uri from 200 (allow) to 403 (deny). Missing header signals misconfiguration or tampering; fail-closed is the safer default. MEDIUM — Private key in process args: Add --private-key-file flag and deprecate --private-key. Key is no longer visible in ps output or shell history when using file or env var. MEDIUM — No wallet address validation: Add ValidateWallet() using go-ethereum/common.IsHexAddress with explicit 0x prefix check. Applied at all entry points (CLI, setup, verifier). * security: architecture hardening across inference subsystem Address 8 findings from architecture review: - Path traversal in store: add ValidateName() regex guard on deployment names in Create/Get/Delete (prevents ../escape) - Standalone binaries wallet validation: add ValidateWallet() to x402-verifier and inference-gateway entry points - Bounded response capture: cap responseCapture at 64 MiB to prevent OOM from unbounded upstream responses during encryption - TEE/SE mutual exclusion: NewGateway() rejects configs with both TEEType and EnclaveTag set - Container name sanitization: add sanitizeContainerName() stripping unsafe chars, lowercasing, and truncating to 63 chars - Attestation error redaction: return generic error to client, log details server-side only - HTTPS on facilitator URL: require HTTPS for facilitator URLs with loopback exemption for local dev/testing - Unified chain support: inference-gateway uses shared ResolveChain() supporting all 6 chains instead of inline 2-chain switch * refactor: rename CLI commands — inference→service, x402→monetize Align CLI surface for workload-agnostic compute monetization: - `obol inference` → `obol service` — the gateway serves any workload (inference, fine-tuning, indexing, RPC), not just inference. All subcommands renamed (create/deploy/serve/etc). - `obol x402` → `obol monetize` — payment gating and on-chain registration are about monetization, not the x402 protocol specifically. Subcommand `setup` renamed to `pricing`. Internal packages unchanged (internal/inference/, internal/x402/). This is a CLI-layer rename only. * Add ServiceOffer CRD, monetize skill, and obol-agent singleton workflow Implements CRD-driven compute monetization: ServiceOffer CR declares upstream services, pricing, and wallet; the obol-agent reconciles them through model pull, health check, ForwardAuth middleware, HTTPRoute, and optional ERC-8004 registration. - ServiceOffer CRD (obol.network/v1alpha1) with status conditions - openclaw-monetize ClusterRole/ClusterRoleBinding and admission policy - monetize skill (SKILL.md + monetize.py reconciler + references) - kube.py write helpers (api_post, api_patch, api_delete) - Singleton obol-agent init with heartbeat injection - CLI: obol monetize {offer,list,status,delete} - Replace admin RoleBinding with scoped network Roles - Remove busybox deployment from obol-agent.yaml - Fix smoke test to use canonical skill names * Clean up branch for public repo: remove sensitive files and competitor references - Delete session transcripts (tee-linux.txt, plans/phase-2b-linux-tee.*) - Remove all ecloud competitor references from service.go, client.go, enclave_middleware.go, store.go - Fix stale obol inference → obol service naming in obolup.sh - Fix x402.go → monetize.go reference in docs/x402-test-plan.md * test: Phase 0 — static validation + test infrastructure Adds unit tests validating embedded K8s manifests and CLI structure, plus shared test utilities for Anvil forks and mock x402 facilitator. New files: - internal/embed/embed_crd_test.go: CRD, RBAC, admission policy parsing - cmd/obol/monetize_test.go: CLI command structure and required flags - internal/testutil/anvil.go: Anvil fork helper (Base Sepolia) - internal/testutil/facilitator.go: Mock x402 facilitator (httptest) Modified: - internal/embed/embed_skills_test.go: monetize.py syntax + kube.py helpers * test: Phase 1 — CRD lifecycle integration tests Adds 7 integration tests for ServiceOffer CRD CRUD operations: - CRD exists in cluster - Create/Get with field verification - List across namespace - Status subresource patch (conditions) - Wallet regex validation rejection - Printer columns (Model, Price, Ready, Age) - Delete with 404 verification Each test creates its own namespace (auto-cleaned up). Requires: running cluster with obol stack up. * test: Phase 2 — RBAC + reconciliation integration tests Adds 6 integration tests for monetize RBAC and reconciliation: - ClusterRole exists with obol.network, traefik.io, gateway API groups - ClusterRoleBinding has openclaw-* service account subjects - monetize.py list runs without error from inside agent pod - monetize.py process --all returns HEARTBEAT_OK with no offers - process with non-existent upstream sets UpstreamHealthy=False - process is idempotent (second run is no-op) Requires: running cluster + obol-agent deployed. * refactor: rename apiVersion obol.network -> obol.org across all files CRD, RBAC, monetize skill, CLI, agent RBAC, docs, and tests all updated to use obol.org as the API group. * test: Phase 3 — routing integration tests with Anvil upstream Adds 7 integration tests for routing with Anvil: - TestIntegration_Route_AnvilUpstream: Anvil RPC reachable from host - TestIntegration_Route_FullReconcile: create→process→conditions - TestIntegration_Route_MiddlewareCreated: ForwardAuth middleware exists - TestIntegration_Route_HTTPRouteCreated: HTTPRoute with traefik-gateway - TestIntegration_Route_TrafficRoutes: traffic routes through Traefik - TestIntegration_Route_DeleteCascades: delete cascades cleanup Adds helpers: requireAnvil, deployAnvilUpstream, serviceOfferWithAnvil, getConditionStatus, waitForCondition. * test: Phase 4+5 — payment gate + full E2E integration tests Phase 4 (Payment Gate): - TestIntegration_PaymentGate_VerifierHealthy: verifier healthz/readyz - TestIntegration_PaymentGate_402WithoutPayment: 402 without X-PAYMENT - TestIntegration_PaymentGate_RequirementsFormat: 402 body has accepts array - TestIntegration_PaymentGate_200WithPayment: 200 with valid X-PAYMENT Phase 5 (Full E2E): - TestIntegration_E2E_OfferLifecycle: CLI create→reconcile→pay→delete - TestIntegration_E2E_HeartbeatReconciles: heartbeat auto-reconciles - TestIntegration_E2E_ListAndStatus: monetize list + offer-status Helpers: setupMockFacilitator (patches x402-verifier ConfigMap to use host-side httptest.Server via host.k3d.internal), addPricingRoute. * feat: add x402 pricing route management and tunnel E2E tests The monetize reconciler now autonomously manages x402-pricing ConfigMap routes during stage_payment_gate and cleanup on delete. Without this, the x402-verifier passed through all requests for free (200 for unmatched routes). Changes: - monetize.py: _add_pricing_route() and _remove_pricing_route() manage x402-pricing ConfigMap entries during reconciliation and deletion - RBAC: add configmaps get/list/patch to openclaw-monetize ClusterRole - Tests: TestIntegration_Tunnel_OllamaMonetized (full tunnel E2E with Ollama model + x402 + CF tunnel) and TestIntegration_Tunnel_AgentAutonomousMonetize (agent-driven lifecycle) - RBAC unit test updated to verify configmaps permission * test: Phase 7 — fork validation and agent skill iteration tests Add integration tests for Anvil fork-based payment flows and agent error recovery scenarios. TestIntegration_Fork_FullPaymentFlow validates the complete 402→payment→200 cycle with a mock facilitator on a forked Base Sepolia. TestIntegration_Fork_AgentSkillIteration tests that the agent can recover from a bad upstream by fixing and re-processing. * feat: align ServiceOffer schema with x402 and ERC-8004 standards Rename CRD fields to match canonical x402/ERC-8004 wire formats: - pricing → payment (with payTo, network, scheme, maxTimeoutSeconds) - wallet → payment.payTo - chain → payment.network - register: bool → registration: object (with ERC-8004 services[], supportedTrust[]) - Add spec.type discriminator (inference, fine-tuning) with PriceTable Add shared schemas package (internal/schemas/) as canonical source for ServiceOffer, PaymentTerms, RegistrationSpec types used by CRD, CLI, verifier, and reconciler. Support per-route payTo/network overrides in x402 verifier RouteRule, enabling multiple ServiceOffers with different wallets/chains. Update all tests, CLI flags, Python reconciler, and documentation. * feat: add Ollama model pull/list commands and obolup improvements Add `obol model pull` and `obol model list` CLI commands for managing Ollama models. Update obolup.sh with improved installation flow. Fix admission policy API group reference. * test: add unit tests for schemas, x402 route options, verifier overrides, and CLI flags Cover previously untested monetize lifecycle code: - schemas/: EffectiveRequestPrice logic, JSON/YAML round-trips, field naming - x402/setup: WithPayTo/WithNetwork route options, RouteRule serialization - x402/verifier: per-route PayTo/Network overrides, invalid chain handling - cmd/obol/monetize: flag existence, defaults, and required markers for all 8 subcommands * fix: sell-side lifecycle blockers, e2e payment test, and test helper consolidation (#225) * fix: sell-side lifecycle blockers and e2e payment test Four blockers found and fixed during end-to-end sell-side walkthrough: 1. CRD/RBAC/admission resources gated by obolAgent.enabled=false (never deployed). Removed conditional guards from all 4 templates and the stale helmfile value — these resources are safe to deploy unconditionally. 2. x402-verifier container image not published. Added Dockerfile.x402-verifier (multi-stage: golang builder → distroless). 3. monetize.py hangs on /api/pull for large cached models. Added _ollama_model_exists() check via /api/tags before attempting slow pull. 4. host.docker.internal rejected by facilitator URL HTTPS validation. Added to the allow list alongside host.k3d.internal. New integration test TestIntegration_PaymentGate_FullLifecycle verifies the complete flow: mock facilitator → patch ConfigMap → 402 without payment → 200 with payment → Ollama inference response. * refactor: consolidate mock facilitator and ConfigMap injection helpers Move duplicated test infrastructure from internal/x402/e2e_test.go into the shared internal/testutil package: - Add platform detection (clusterHostURL) to testutil/facilitator.go so StartMockFacilitator uses host.docker.internal on macOS and host.k3d.internal on Linux, fixing the divergence between the two implementations. - Extract ConfigMap patching, verifier restart, and cleanup into new testutil/verifier.go (PatchVerifierFacilitator), eliminating ~40 lines of boilerplate from the e2e test. - Replace race-unsafe plain int32 counters in the old hostMockFacilitator with the existing atomic.Int32 fields on MockFacilitator. - Remove startHostMockFacilitator, buildTestPaymentHeader, patchFacilitatorURL, restoreConfigMap, waitForVerifierReload, and the hostMockFacilitator type from e2e_test.go. Net: -177 lines from e2e_test.go, +120 lines of reusable test helpers. * fix: replace ollama ExternalName with ClusterIP+Endpoints and docker0 fallback (#228) * fix: replace ollama ExternalName with ClusterIP+Endpoints for Gateway API Traefik's Gateway API controller rejects ExternalName services as HTTPRoute backends, causing 500 errors after valid x402 payment (ForwardAuth passes but Traefik can't proxy to the backend). Replace the ExternalName ollama service with a ClusterIP service paired with a manual Endpoints object. The endpoint IP is resolved at `obol stack init` time via a new {{OLLAMA_HOST_IP}} placeholder: - k3s: 127.0.0.1 (already an IP, no resolution needed) - k3d on macOS: net.LookupHost("host.docker.internal"), fallback 192.168.65.254 - k3d on Linux: net.LookupHost("host.k3d.internal"), fallback 127.0.0.1 The existing {{OLLAMA_HOST}} placeholder is preserved for backward compatibility with other consumers. * fix: resolve Ollama host IP via docker0 fallback on Linux On Linux, host.k3d.internal only resolves inside k3d's CoreDNS, not on the host machine. ollamaHostIPForBackend() now falls back to the docker0 bridge interface IP (typically 172.17.0.1) which is reachable from all Docker containers regardless of their network. Resolution strategy: 1. If already an IP (k3s), return as-is 2. Try DNS resolution (works on macOS Docker Desktop) 3. On Linux k3d, fall back to docker0 interface IP * ci: add x402-verifier Docker image build workflow (#226) * fix: monetize healthPath default and dev skill documentation (#227) Change upstream healthPath default from /health to / since Ollama responds with "Ollama is running" at / but returns 404 at /health. Add quiet parameter to kube.py api_get to suppress noisy stderr output during existence checks (404s that are expected and handled). Document sell-side monetize lifecycle in the dev skill including architecture, three-layer integration, testing commands, and gotchas. * test: Phase 0 — static validation + test infrastructure (#219) * test: Phase 0 — static validation + test infrastructure Adds unit tests validating embedded K8s manifests and CLI structure, plus shared test utilities for Anvil forks and mock x402 facilitator. New files: - internal/embed/embed_crd_test.go: CRD, RBAC, admission policy parsing - cmd/obol/monetize_test.go: CLI command structure and required flags - internal/testutil/anvil.go: Anvil fork helper (Base Sepolia) - internal/testutil/facilitator.go: Mock x402 facilitator (httptest) Modified: - internal/embed/embed_skills_test.go: monetize.py syntax + kube.py helpers * refactor: rename apiVersion obol.network -> obol.org across all files CRD, RBAC, monetize skill, CLI, agent RBAC, docs, and tests all updated to use obol.org as the API group. * fix: resolve host.docker.internal via Docker container DNS host.docker.internal is only in Docker's DNS, not the macOS host's. PR #228 (ClusterIP+Endpoints) requires an IP at init time, which broke `obol stack init` on macOS. Add dockerResolveHost() that runs `docker run --rm alpine nslookup <hostname>` as a fallback between host-side DNS and the Linux docker0 bridge. * fix: replace dockerResolveHost with hardcoded Docker Desktop gateway Spawning a container to resolve host.docker.internal is slow and fragile. Use Docker Desktop's well-known VM gateway IP (192.168.65.254) directly as the macOS fallback. This IP is stable across Docker Desktop versions. * fix: integration test failures and remove nodecore-token-refresher Test fixes: - Replace resolveK3dHostIP() kubectl exec into distroless container with testutil.ClusterHostIP() (macOS: 192.168.65.254, Linux: docker0 bridge) - Fix CRD field names in ollamaServiceOfferYAML() and Fork_AgentSkillIteration (pricing/wallet → payment.payTo/price.perRequest) - Use port-forward for verifier health check (distroless has no wget/sh) - Add EndpointSlice propagation wait in skill iteration test Cleanup: - Remove nodecore-token-refresher CronJob (oauth-token.yaml) and Reloader annotations from eRPC values * feat: auto-build and import local Docker images during stack up Build images like x402-verifier from source and import them into the k3d cluster. This eliminates ImagePullBackOff errors when GHCR images haven't been published yet. Gracefully skips when Dockerfiles aren't present (production installs without source). * fix: prefer openclaw-obol-agent instance for monetize tests When multiple OpenClaw instances exist, the test helper agentNamespace() now prefers openclaw-obol-agent since that's the instance with monetize RBAC (patched by `obol agent init`). Fixes 403 errors on fresh clusters with both default and obol-agent instances. * fix: wait for EndpointSlice propagation in deployAnvilUpstream Add an active readiness check that polls the Anvil service from inside the cluster before proceeding. On Linux, docker0 bridge + DNS propagation can take longer than the previous static sleep. * fix: bind Anvil to 0.0.0.0 for Linux k3d cluster access On Linux, k3d containers reach the host via docker0 bridge IP (172.17.0.1), not localhost. Anvil was bound to 127.0.0.1, causing "Connection refused" from inside the cluster. Bind to 0.0.0.0 so it's reachable from any interface. * fix: bind mock facilitator to 0.0.0.0 for Linux k3d access Same issue as Anvil: the mock facilitator was bound to 127.0.0.1, unreachable from k3d containers via docker0 bridge on Linux. * fix: gate local image build behind OBOL_DEVELOPMENT mode buildAndImportLocalImages should only run in development mode, not during production obol stack up. Production users pull pre-built images from GHCR. * docs: add OBOL_DEVELOPMENT=true to integration test env setup The local image build during stack up is gated behind OBOL_DEVELOPMENT. Update CLAUDE.md, dev skill references, and SKILL.md constraints to include this env var in all integration test setup instructions. * feat: implement ERC-8004 on-chain agent registration via remote-signer Add full in-pod ERC-8004 registration to the monetize skill, enabling agents to register themselves on the Identity Registry (Base Sepolia) using their auto-provisioned remote-signer wallet. Phase 1a: Add Base Sepolia to eRPC with two public RPC upstreams (sepolia.base.org, publicnode.com) and network alias routing. Phase 1b-1c: Implement register(string) calldata encoding in pure Python stdlib (hardcoded selector, manual ABI encoding), with full sign→broadcast→receipt→parse flow via remote-signer + eRPC. Phase 1d: Update CLI to read ERC-8004 registration from CRD status (single source of truth) instead of disk-based store. Remove RegistrationRecord disk writes from `monetize register` command. * chore: remove dead erc8004 disk store (CRD status is source of truth) * feat: add `obol rpc` command with ChainList auto-population Adds a new `obol rpc` CLI command group for managing eRPC upstreams: - `rpc list` — reads eRPC ConfigMap and displays configured networks with their upstream endpoints - `rpc add <chain>` — fetches free public RPCs from ChainList API (chainlist.org/rpcs.json), filters for HTTPS-only and low-tracking endpoints, sorts by quality, and adds top N to eRPC ConfigMap - `rpc remove <chain>` — removes ChainList-sourced RPCs for a chain - `rpc status` — shows eRPC pod health and upstream counts per chain Supports both chain names (base, arbitrum, optimism) and numeric chain IDs (8453, 42161). ChainList fetcher is injectable for testing. New files: - cmd/obol/rpc.go — CLI wiring - cmd/obol/rpc_test.go — command structure tests - internal/network/chainlist.go — ChainList API client and filtering - internal/network/chainlist_test.go — unit tests with fixture data - internal/network/rpc.go — eRPC ConfigMap read/patch operations * feat: add agent discovery skill for ERC-8004 registry search Add a new `discovery` skill that enables OpenClaw agents to discover other AI agents registered on the ERC-8004 Identity Registry. This completes the buy-side of the agent marketplace — agents can now search for, inspect, and evaluate other agents' services on-chain. Skill contents: - SKILL.md: usage docs, supported chains, environment variables - scripts/discovery.py: pure Python stdlib CLI with four commands: - search: list recently registered agents via Registered events - agent: get agent details (tokenURI, owner, wallet) - uri: fetch and display the agent's registration JSON - count: total registered agents (totalSupply or event count) - references/erc8004-registry.md: contract addresses, function selectors, event signatures, agentURI JSON schema Supports 20+ chains via CREATE2 addresses (mainnet + testnet sets). All queries are read-only, routed through the in-cluster eRPC gateway. * feat: E2E monetize plumbing — facilitator URL, custom RPC, registration publishing Closes the gaps found during E2E testing of the full monetize flow (fresh cluster → ServiceOffer → 402 → paid inference → lifecycle). Changes: - Add --facilitator-url flag to `obol monetize pricing` (+ X402_FACILITATOR_URL env) so self-hosted facilitators are first-class, not a kubectl-patch afterthought - Add --endpoint flag to `obol rpc add` with AddCustomRPC() for injecting local Anvil forks or custom RPCs into eRPC without ChainList - Expand monetize RBAC: agent can now create/delete ConfigMaps, Services, Deployments (needed for agent-managed registration httpd) - Agent reconciler publishes ERC-8004 registration JSON: creates ConfigMap + busybox httpd Deployment + Service + HTTPRoute at /.well-known/ path, all with ownerReferences for automatic GC on ServiceOffer deletion - `monetize delete` now removes pricing routes and deactivates registration (sets active=false in ConfigMap) before deleting the CR - Extract removePricingRoute() helper (DRY: used by both stop and delete) - Add --register-image flag for ERC-8004 required `image` field - Add docs/guides/monetize-inference.md walkthrough guide * docs: refresh CLAUDE.md — trim bloat, fix drift, add monetize subsystem CLAUDE.md had drifted significantly (1385 lines) with stale content and missing documentation for the monetize/x402/ERC-8004 subsystem. Changes: - 1385 → 505 lines (64% reduction) - Fixed stale paths: internal/embed/defaults/ → internal/embed/infrastructure/ - Fixed stale function signature: Setup() now takes facilitatorURL param - Added full Monetize Subsystem section (data flow, CLI, CRD, ForwardAuth, agent reconciler, ERC-8004 registration, RBAC) - Added RPC Gateway Management section (obol rpc add/list/remove/status) - Updated CLI command tree to match actual main.go (monetize, rpc, service, agent) - Updated Embedded Infrastructure section with all 7 templates - Updated skill count: 21 → 23 (added monetize, discovery) - Trimmed verbose sections: obolup.sh internals, network install parser details, full directory trees, redundant examples - Kept testnet/facilitator operational details in guides/skills (not CLAUDE.md) * Prep for a change to upstream erpc * Drop the :4000 from the local erpc, its inconvenient * fix: harden monetize subsystem — RBAC split, URL validation, HA, kubectl extraction (#235) * fix: harden monetize subsystem — RBAC split, URL validation, kubectl extraction Address 11 review findings from plan-exit-review: 1. **RBAC refactor**: Split monolithic `openclaw-monetize` ClusterRole into `openclaw-monetize-read` (cluster-wide read-only) and `openclaw-monetize-workload` (cluster-wide mutate). Add scoped `openclaw-x402-pricing` Role in x402 namespace for pricing ConfigMap. Update `patchMonetizeBinding()` to patch all 3 bindings. 2. **Extract internal/kubectl**: Eliminate ~250 lines of duplicated kubectl path construction and cluster-presence checks across 8 consumer files into a single `internal/kubectl` package. 3. **Fix ValidateFacilitatorURL bypass**: Replace `strings.HasPrefix` with `url.Parse()` + exact hostname matching to prevent http://localhost-hacker.com bypass. 4. **Pre-compute per-route chains**: Resolve all chain configs at Verifier load time instead of per-request, catching invalid chains early and eliminating hot-path allocations. 5. **x402-verifier HA**: Bump replicas to 2, add PodDisruptionBudget (minAvailable: 1) to prevent fail-open during rolling updates. 6. **Agent init errors fatal**: Make patchMonetizeBinding and injectHeartbeatFile failures return errors instead of warnings. 7. **Input validation in monetize.py**: Add strict regex validation for route patterns, prices, addresses, and network names to prevent YAML injection. 8. **Health check retries**: Add 3-attempt retry with 2s backoff to `stage_upstream_healthy` for transient pod startup failures. 9. **Test coverage**: Add 16-case ValidateFacilitatorURL test (including bypass regression), kubectl package tests, RBAC document structure tests, and load-time chain rejection test. * fix: use kubectl.Output in x402 e2e test after kubectl extraction The hardening commit extracted duplicated kubectl helpers into internal/kubectl but missed updating the x402 e2e integration test, causing a build failure. Use kubectl.Output instead of the removed local kubectlOutput function. * Overhaul cli ux attempt 1 * Update cli arch * chore: bump llmspy to v3.0.38-obol.3 and remove stream_options monkey-patch Synced llmspy fork with upstream v3.0.38. All Obol-specific fixes (SSE tool_call passthrough, per-provider tool_call config, process_chat tools preservation) are now in the published image. This removes the runtime stream_options monkey-patch from the init container and the PYTHONPATH override that were needed for the old image. Also adds tool_call: false to the Ollama provider config so llmspy passes tool calls through to the client (OpenClaw) instead of attempting server-side execution. * fix: preserve pricing routes in x402 Setup, fix EIP-712 USDC domain, add payment flow tests Key changes: - x402 Setup() now reads existing pricing config and preserves routes added by the ServiceOffer reconciler (was overwriting with empty array) - EIP-712 signer uses correct USDC domain name ("USDC" not "USD Coin") for Base Sepolia TransferWithAuthorization signatures - Add full payment flow integration tests (402 → EIP-712 sign → 200) - Add test utilities: Anvil fork helpers, real facilitator launcher, EIP-712 payment header signer - Remove standalone inference-gateway (replaced by obol service serve) - Tunnel agent discovery, openclaw monetize integration tests * docs: rewrite getting-started guide and update monetize guide from fresh install verification getting-started.md: Full rewrite covering the complete journey from install to monetized inference. Verified every command against a fresh cluster (vast-flounder). Adds agent deployment, LLM inference testing with tool calls, and links to monetize guide. monetize-inference.md: Multiple fixes from end-to-end verification: - Fix node count (1 not 4), pod counts (2 x402 replicas) - Fix model pulling (host Ollama, not in-cluster kubectl exec) - Add concrete 402 response JSON example - Fix EIP-712 domain name info (USDC not USD Coin) - Fix payment header name (X-PAYMENT not PAYMENT-SIGNATURE) - Fix facilitator config JSON format - Add USDC settlement verification section (cast balance checks) - Add Cloudflare tunnel payment verification section - Update troubleshooting for signature errors and pricing route issues * fix: sanitize user-controlled error in enclave middleware log Use %q instead of %v to escape control characters in the decrypt error, preventing log injection via crafted ciphertext (CodeQL go/log-injection #2658). * docs: rename obol monetize → obol sell across docs, tests, and skills Update all CLI command references from the old names to the new: - obol monetize offer → obol sell http - obol monetize offer-status → obol sell status - obol monetize list/stop/delete/pricing/register → obol sell ... - obol service → obol sell inference * fix: remove duplicate eRPC port name causing Service validation error The strategicMergePatches block added a second port named "http" (port 80) which conflicted with the chart's existing "http" port (4000), causing `spec.ports[1].name: Duplicate value: "http"` on stack up. Remove the patches and update HTTPRoute backendRef to use port 4000 directly. * fixes to merge * Push updates, still concerned about model upgrade, seem stuck on fpt-oss * Less broken, but llmspy + anthropic still broken * Things close to stable --------- Co-authored-by: bussyjd <jd@obol.tech> Co-authored-by: bussyjd <silversurfer972@gmail.com>

* Add pre-flight port check before cluster creation When `obol stack up` creates a new cluster, k3d tries to bind host ports 80, 8080, 443, and 8443. If any are already in use, Docker fails with a cryptic error and rolls back the entire cluster. Add a `checkPortsAvailable()` pre-flight check that probes each required port with `net.Listen` before invoking k3d. On conflict, the error message lists the blocked port(s) and shows a `sudo lsof` command to identify the offending process. * Track llmspy image releases via Renovate Add custom regex manager to detect new ObolNetwork/llms releases and auto-bump the image tag in llm.yaml. Follows the same pattern used for obol-stack-front-end and OpenClaw version tracking. * Replace hardcoded gpt-oss:120b-cloud with dynamic Ollama model detection The default model gpt-oss:120b-cloud does not exist and caused OpenClaw to deploy with a non-functional model configuration. Instead, query the host's Ollama server for actually available models and use those in the overlay. When no models are pulled, deploy with an empty model list and guide users to `obol model setup` or `ollama pull`. * Add obol-stack-dev skill, integration tests, and README updates - Add `obol-stack-dev` skill with full reference docs for LLM smart-routing through llmspy (architecture, CLI wrappers, overlay generation, integration testing, troubleshooting) - Add integration tests (`//go:build integration`) that deploy 3 OpenClaw instances through obol CLI verbs and validate inference through Ollama, Anthropic, and OpenAI via llmspy - Expand README model providers section and add OpenClaw commands * feat(enclave): add Secure Enclave key management package Implements internal/enclave — a CGO bridge to Apple Security.framework providing hardware-backed P-256 key management for macOS Secure Enclave. Key capabilities: - NewKey/LoadKey: generate or retrieve SE-backed P-256 keys persisted in the macOS keychain (kSecAttrTokenIDSecureEnclave); falls back to an ephemeral in-process key when the binary lacks keychain entitlements (e.g. unsigned test binaries) - Sign: ECDSA-SHA256 via SecKeyCreateSignature — private key never leaves the Secure Enclave co-processor - ECDH: raw shared-secret exchange via SecKeyCopyKeyExchangeResult - Encrypt/Decrypt: ECIES using ephemeral ECDH + HKDF-SHA256 + AES-256-GCM Wire format: [1:version][65:ephPubKey][12:nonce][ciphertext+16:GCM-tag] - CheckSIP: verify System Integrity Protection is active via sysctl kern.csr_active_config; treats absent sysctl (macOS 26/Apple Silicon) as SIP fully enabled (hardware-enforced) Platform coverage: - darwin + cgo: full Security.framework implementation - all other platforms: stubs returning ErrNotSupported so the module builds cross-platform without conditional compilation at call sites Tests cover: key generation, load, sign, ECIES round-trip, tamper detection, idempotent NewKey, and SIP check. TestLoadKey / TestNewKeyIdempotent skip gracefully when running as an unsigned binary. * feat(inference): wire Secure Enclave into x402 gateway Adds SE-backed request encryption to the inference gateway, closing parity with ecloud's JWE-encrypted deployment secrets — applied here at the per-request level rather than deploy-time only. Changes: - internal/inference/enclave_middleware.go New HTTP middleware (enclaveMiddleware) that: • Decrypts Content-Type: application/x-obol-encrypted request bodies using the SE private key (ECIES-P256-HKDF-SHA256-AES256GCM) • Reconstructs the request as plain application/json before proxying • If X-Obol-Reply-Pubkey header present, encrypts the upstream response back to the client's ephemeral key (end-to-end confidentiality) • Exposes handlePubkey() for GET /v1/enclave/pubkey - internal/inference/gateway.go • New GatewayConfig.EnclaveTag field (empty = plaintext mode, backward compatible) • Registers GET /v1/enclave/pubkey when EnclaveTag is set • Stacks layers: upstream → SE decrypt → x402 payment → client (operator sees only that a paid request arrived, never its content) - cmd/obol/inference.go • --enclave-tag / -e / $OBOL_ENCLAVE_TAG flag on obol inference serve • New obol inference pubkey <tag> subcommand: prints or JSON-dumps the SE public key — equivalent to `ecloud compute app info` for identity - internal/inference/enclave_middleware_test.go Tests: pubkey JSON shape, encrypted response round-trip, plaintext passthrough, gateway construction with EnclaveTag. * feat(inference): add deployment lifecycle commands (ecloud parity) Implements a persistent inference deployment store and full lifecycle CLI mirroring ecloud's 'compute app' surface: ecloud compute app deploy → obol inference create / deploy ecloud compute app list → obol inference list ecloud compute app info → obol inference info ecloud compute app terminate → obol inference delete ecloud compute app info pubkey → obol inference pubkey internal/inference/store.go: - Deployment struct: name, enclave_tag, listen_addr, upstream_url, wallet_address, price_per_request, chain, facilitator_url, timestamps - Store: Create (with defaults + force flag), Get, List, Update, Delete - Persisted at ~/.config/obol/inference/<name>/config.json (mode 0600) - EnclaveTag auto-derived: "com.obol.inference.<name>" if not set cmd/obol/inference.go (rewrites inference.go): obol inference create <name> — register deployment config obol inference deploy <name> — create-or-update + start gateway obol inference list — tabular or JSON listing obol inference info <name> — config + SE pubkey (--json) obol inference delete <name> — remove config (--purge-key also removes SE key from keychain) obol inference pubkey <name> — resolve name → tag → SE pubkey obol inference serve — low-level inline gateway (no store) All commands accept --json flag for machine-readable output. * feat(inference): add cross-platform client SDK for SE gateway Extract pure-Go ECIES (encrypt + deriveKey) from enclave_darwin.go into enclave/ecies.go so the encryption half is available without CGO or Darwin. Add inference.Client — an http.RoundTripper that: - Fetches and caches the gateway's SE public key from GET /v1/enclave/pubkey - Transparently encrypts request bodies (ECIES) before forwarding - Optionally attaches X-Obol-Reply-Pubkey for end-to-end encrypted responses - Decrypts encrypted responses when EnableEncryptedReplies is active Mirrors ecloud's encryptRSAOAEPAndAES256GCM client pattern but for live per-request encryption rather than deploy-time secret encryption. * fix(inference): address P0/P1/P2 review findings P0 — Duplicate flag panic on deploy/serve --help: --force moved to create-only; deploy uses deployFlags() only. --wallet duplicate in serve eliminated (deployFlags() already defines it). P1 — Encrypted reply Content-Length mismatch: After encrypting upstream response, refresh Content-Length to encrypted body size and clear Content-Encoding/ETag before writing headers. P1 — SIP not enforced at runtime: gateway.Start() now calls enclave.CheckSIP() before initialising enclaveMiddleware when EnclaveTag is set; refuses to start if SIP disabled. P2 — applyFlags overwrites existing config with flag defaults: Switch from c.String(...) to c.IsSet(...) guard so only flags the user explicitly set are merged into the stored Deployment. P2 — Shallow middleware test coverage: Replace placeholder tests with five real wrapper-path tests covering pubkey endpoint shape, encrypted-request decrypt, plaintext passthrough, encrypted-reply header refresh (Content-Length/Content-Encoding/ETag), and invalid reply pubkey rejection. Add CLI regression tests (inference_test.go): deploy --help and serve --help no-panic checks, serve wallet-required guard, applyFlags explicit-only mutation invariant. * feat(inference): add Apple Containerization VM mode + fix security doc claims Container integration (apple/container v0.9.0): - internal/inference/container.go: ContainerManager wraps `container` CLI to start/stop Ollama in an isolated Linux micro-VM; polls Ollama health endpoint before gateway accepts requests - internal/inference/store.go: add VMMode, VMImage, VMCPUs, VMMemoryMB, VMHostPort fields to Deployment - internal/inference/gateway.go: start ContainerManager on Start() when VMMode=true, override UpstreamURL to container's localhost-mapped port, stop container on Stop(); fix misleading operator-can't-read comment - cmd/obol/inference.go: add --vm, --vm-image, --vm-cpus, --vm-memory, --vm-host-port flags; wire through applyFlags and runGateway Doc fixes: - plans/pitch-diagrams.md: correct Diagram 1 (transit encryption not operator-blind), Diagram 5 (SIP blocks external attackers not operator), Diagram 7 (competitive matrix: Phase 1.5a at [0.85,0.20] not [0.85,0.88]) * fix(inference): fix wallet flag parsing + support --name flag Two issues fixed: 1. applyFlags used c.IsSet("wallet") which could return false even when --wallet was explicitly passed; changed to non-empty check for flags that have no meaningful empty default (wallet, enclave-tag). 2. urfave/cli v2 stops flag parsing at the first positional arg, so `deploy test-vm --wallet addr` silently ignored the wallet flag. Fixed by adding a --name/-n flag to deployFlags() as an alternative to the positional argument. Users can now use either: obol inference deploy --wallet <addr> [flags] <name> obol inference deploy --name <name> --wallet <addr> [flags] Added wallet validation before store.Create to prevent writing bad configs. Tested end-to-end: VM mode container starts, Ollama becomes ready in ~2s (cached image), gateway serves /health 200 and /v1/chat/completions 402. * feat(inference): stream container image pull progress Previously `container run --detach` silently pulled the image inline, causing a 26-minute silent wait on first run with no user feedback. Now runs an explicit `container pull <image>` with stdout/stderr wired to the terminal before starting the container, so users see live download progress. On cache hit the pull completes in milliseconds. * chore(deps): migrate urfave/cli v2.27.7 → v3.6.2 Breaking changes applied across all cmd/obol files: - cli.App{} → cli.Command{} (top-level app is now a Command) - All Action signatures: func(*cli.Context) error → func(context.Context, *cli.Command) error - All Subcommands: → Commands: - EnvVars: []string{...} → Sources: cli.EnvVars(...) (X402_WALLET, OBOL_ENCLAVE_TAG, CLOUDFLARE_*, LLM_API_KEY) - cli.AppHelpTemplate → cli.RootCommandHelpTemplate - app.Run(os.Args) → app.Run(context.Background(), os.Args) - All c.XXX() accessor calls → cmd.XXX() (~70 occurrences) - cmd.Int() now returns int64; added casts for VMCPUs, VMMemoryMB, VMHostPort, openclaw dashboard port - Passthrough command local var renamed cmd → proc to avoid shadowing the *cli.Command action parameter - inference_test.go: rewrote deployContext() — cli.NewContext removed in v3; new impl runs a real *cli.Command and captures parsed state Removed v2 transitive deps: go-md2man, blackfriday, smetrics. * chore: ignore plans/ directory (kept local, not for public repo) * docs(claude): update CLAUDE.md for cli v3 migration + inference gateway - Fix CLI framework reference: urfave/cli/v2 → v3 - Update passthrough command example to v3 Action signature (context.Context, *cli.Command) - Fix go.mod dependency listing - Expand inference command tree (create/deploy/list/info/delete/pubkey/serve) - Add Inference Gateway section: architecture, deployment lifecycle, SE integration, VM mode, flag patterns - Add inference/enclave key files to References * feat(obolup): add Apple container CLI installation (VM inference support) Adds install_container() that downloads and installs the signed pkg from github.com/apple/container releases. macOS-only, non-blocking (failure continues with a warning). Pins CONTAINER_VERSION=0.9.0. Enables 'obol inference deploy --vm' for running Ollama in an isolated Apple Containerization Linux micro-VM. * test(inference): add Layer 2 gateway integration tests with mock facilitator Extracts buildHandler() from Start() so tests can inject the handler into an httptest.Server without requiring a real network listener. Adds VerifyOnly to GatewayConfig to skip on-chain settlement in staging/test environments. gateway_test.go implements a minimal mock facilitator (httptest.Server with /supported, /verify, /settle endpoints and atomic call counters) and covers: - Health check (no payment required) - Missing X-PAYMENT header → 402 - Valid payment → verify + settle → 200 - VerifyOnly=true → verify only, no settle → 200 - Facilitator rejects payment → 402, no settle - Upstream down → verify passes, proxy fails → 502 - GET /v1/models without payment → 402 - GET /v1/models with payment → 200 * docs(plans): add phase-2b linux TEE plan + export context * feat(tee): add Linux TEE scaffold with stub backend (Phase 2b Steps 1-3) Introduce internal/tee/ package providing a hardware-agnostic TEE key and attestation API that mirrors the macOS Secure Enclave interface. The stub backend enables full integration testing on any platform without requiring TDX/SNP/Nitro hardware. - internal/tee/: key management, ECIES decrypt, attestation reports, user_data binding (SHA256(pubkey||modelHash)), verification helpers - Gateway: TEE vs SE key selection, GET /v1/attestation endpoint - Store: TEEType + ModelHash fields on Deployment - CLI: --tee and --model-hash flags on create/deploy/serve/info/pubkey - Tests: 14 tee unit tests + 4 gateway TEE integration tests * feat(tee): ground TEE backends with real attestation libraries Replace TODO placeholders with real library calls for all three TEE backends, anchoring the code to actual APIs that compile and can be verified on hardware later. Attest backends (behind build tags, not compiled by default): - SNP: github.com/google/go-sev-guest/client — GetQuoteProvider() + GetRawQuote() via /dev/sev-guest or configfs-tsm - TDX: github.com/google/go-tdx-guest/client — GetQuoteProvider() + GetRawQuote() via /dev/tdx-guest or configfs-tsm - Nitro: github.com/hf/nsm — OpenDefaultSession() + Send(Attestation) via /dev/nsm with COSE_Sign1 attestation documents Verify functions (no build tag, compiles everywhere): - VerifySNP: go-sev-guest/verify + validate (VCEK cert chain, ECDSA-P384) - VerifyTDX: go-tdx-guest/verify + validate (DCAP PCK chain, ECDSA-256) - VerifyNitro: hf/nitrite (COSE/CBOR, AWS Nitro Root CA G1) - ExtractUserData: auto-detects SNP (1184 bytes), TDX (v4 + 0x81), Nitro (CBOR tag 0xD2), and stub (JSON) formats Tests: 22 passing (14 existing + 8 new verification surface tests) * feat(tee): add CoCo pod spec + QEMU dev integration tests (Phase 2b Steps 8-9) Add Confidential Containers (CoCo) support to inference templates and integration tests for QEMU dev mode verification on bare-metal k3s. Pod templates: - Conditional runtimeClassName on both Ollama and gateway Deployments - TEE args/env vars passed to gateway container (--tee, --model-hash) - TEE metadata in discovery ConfigMap for frontend visibility - New values: teeRuntime, teeType, teeModelHash with CLI annotations CoCo helper (internal/tee/coco.go): - InstallCoCo/UninstallCoCo via Helm with k3s-specific flags - CheckCoCo returns operator status, runtime classes, KVM availability - ParseCoCoRuntime validates kata-qemu-coco-dev/snp/tdx runtime names Integration tests (go:build integration): - CoCo operator install verification - RuntimeClass existence check - Pod deployment with kata-qemu-coco-dev + kernel isolation proof - Inference gateway attestation from inside CoCo VM * docs(tee): add Phase 2b session transcript export * feat(x402): add ForwardAuth verifier service for per-route micropayments Standalone x402 payment verification service designed for Traefik ForwardAuth. Enables monetising any HTTP route (RPC, inference, etc.) via x402 micropayments without modifying backend services. Components: - internal/x402: config loading, route pattern matching (exact/prefix/glob), ForwardAuth handler reusing mark3labs/x402-go middleware, poll-based config watcher for hot-reload - cmd/x402-verifier: standalone binary with signal handling + graceful shutdown - x402.yaml: K8s resources (Namespace, ConfigMap, Secret, Deployment, Service) * feat(x402): add ERC-8004 client, on-chain registration, and x402 payment gating - Add internal/erc8004 package: Go client for ERC-8004 Identity Registry on Base Sepolia using bind.NewBoundContract (register, setAgentURI, setMetadata, getMetadata, tokenURI, wallet functions) - ABI verified against canonical erc-8004-contracts R&D sources with all 3 register() overloads, agent wallet functions, and events (Registered, URIUpdated, MetadataSet) - Types match ERC-8004 spec: AgentRegistration with image, supportedTrust; ServiceDef with version; OnChainReg with numeric agentId - Add x402 CLI commands: obol x402 register/setup/status - Add well-known endpoint on x402 verifier (/.well-known/agent-registration.json) - Add conditional x402 Middleware CRD + ExtensionRef in infrastructure helmfile - Add x402Enabled flag to inference network template (values + helmfile + gateway) - Add go-ethereum v1.17.0 dependency * test: add x402 and ERC-8004 unit test coverage Add comprehensive unit tests for the x402 payment verification and ERC-8004 on-chain registration subsystems: - x402 config loading, chain resolution, and facilitator URL validation - x402 verifier ForwardAuth handler and route matching - x402 config file watcher polling logic - ERC-8004 ABI encoding/decoding roundtrips - ERC-8004 client type serialization and agent registration structs - x402 test plan document covering all verification scenarios * security: fix injection, fail-open, key exposure, and wallet validation Address 4 vulnerabilities found during security review: HIGH — YAML/JSON injection in setup.go: Replace fmt.Sprintf string interpolation with json.Marshal/yaml.Marshal for all user-supplied values (wallet, chain, route configs). MEDIUM — ForwardAuth fail-open: Change empty X-Forwarded-Uri from 200 (allow) to 403 (deny). Missing header signals misconfiguration or tampering; fail-closed is the safer default. MEDIUM — Private key in process args: Add --private-key-file flag and deprecate --private-key. Key is no longer visible in ps output or shell history when using file or env var. MEDIUM — No wallet address validation: Add ValidateWallet() using go-ethereum/common.IsHexAddress with explicit 0x prefix check. Applied at all entry points (CLI, setup, verifier). * security: architecture hardening across inference subsystem Address 8 findings from architecture review: - Path traversal in store: add ValidateName() regex guard on deployment names in Create/Get/Delete (prevents ../escape) - Standalone binaries wallet validation: add ValidateWallet() to x402-verifier and inference-gateway entry points - Bounded response capture: cap responseCapture at 64 MiB to prevent OOM from unbounded upstream responses during encryption - TEE/SE mutual exclusion: NewGateway() rejects configs with both TEEType and EnclaveTag set - Container name sanitization: add sanitizeContainerName() stripping unsafe chars, lowercasing, and truncating to 63 chars - Attestation error redaction: return generic error to client, log details server-side only - HTTPS on facilitator URL: require HTTPS for facilitator URLs with loopback exemption for local dev/testing - Unified chain support: inference-gateway uses shared ResolveChain() supporting all 6 chains instead of inline 2-chain switch * refactor: rename CLI commands — inference→service, x402→monetize Align CLI surface for workload-agnostic compute monetization: - `obol inference` → `obol service` — the gateway serves any workload (inference, fine-tuning, indexing, RPC), not just inference. All subcommands renamed (create/deploy/serve/etc). - `obol x402` → `obol monetize` — payment gating and on-chain registration are about monetization, not the x402 protocol specifically. Subcommand `setup` renamed to `pricing`. Internal packages unchanged (internal/inference/, internal/x402/). This is a CLI-layer rename only. * Add ServiceOffer CRD, monetize skill, and obol-agent singleton workflow Implements CRD-driven compute monetization: ServiceOffer CR declares upstream services, pricing, and wallet; the obol-agent reconciles them through model pull, health check, ForwardAuth middleware, HTTPRoute, and optional ERC-8004 registration. - ServiceOffer CRD (obol.network/v1alpha1) with status conditions - openclaw-monetize ClusterRole/ClusterRoleBinding and admission policy - monetize skill (SKILL.md + monetize.py reconciler + references) - kube.py write helpers (api_post, api_patch, api_delete) - Singleton obol-agent init with heartbeat injection - CLI: obol monetize {offer,list,status,delete} - Replace admin RoleBinding with scoped network Roles - Remove busybox deployment from obol-agent.yaml - Fix smoke test to use canonical skill names * Clean up branch for public repo: remove sensitive files and competitor references - Delete session transcripts (tee-linux.txt, plans/phase-2b-linux-tee.*) - Remove all ecloud competitor references from service.go, client.go, enclave_middleware.go, store.go - Fix stale obol inference → obol service naming in obolup.sh - Fix x402.go → monetize.go reference in docs/x402-test-plan.md * test: Phase 0 — static validation + test infrastructure Adds unit tests validating embedded K8s manifests and CLI structure, plus shared test utilities for Anvil forks and mock x402 facilitator. New files: - internal/embed/embed_crd_test.go: CRD, RBAC, admission policy parsing - cmd/obol/monetize_test.go: CLI command structure and required flags - internal/testutil/anvil.go: Anvil fork helper (Base Sepolia) - internal/testutil/facilitator.go: Mock x402 facilitator (httptest) Modified: - internal/embed/embed_skills_test.go: monetize.py syntax + kube.py helpers * test: Phase 1 — CRD lifecycle integration tests Adds 7 integration tests for ServiceOffer CRD CRUD operations: - CRD exists in cluster - Create/Get with field verification - List across namespace - Status subresource patch (conditions) - Wallet regex validation rejection - Printer columns (Model, Price, Ready, Age) - Delete with 404 verification Each test creates its own namespace (auto-cleaned up). Requires: running cluster with obol stack up. * test: Phase 2 — RBAC + reconciliation integration tests Adds 6 integration tests for monetize RBAC and reconciliation: - ClusterRole exists with obol.network, traefik.io, gateway API groups - ClusterRoleBinding has openclaw-* service account subjects - monetize.py list runs without error from inside agent pod - monetize.py process --all returns HEARTBEAT_OK with no offers - process with non-existent upstream sets UpstreamHealthy=False - process is idempotent (second run is no-op) Requires: running cluster + obol-agent deployed. * refactor: rename apiVersion obol.network -> obol.org across all files CRD, RBAC, monetize skill, CLI, agent RBAC, docs, and tests all updated to use obol.org as the API group. * test: Phase 3 — routing integration tests with Anvil upstream Adds 7 integration tests for routing with Anvil: - TestIntegration_Route_AnvilUpstream: Anvil RPC reachable from host - TestIntegration_Route_FullReconcile: create→process→conditions - TestIntegration_Route_MiddlewareCreated: ForwardAuth middleware exists - TestIntegration_Route_HTTPRouteCreated: HTTPRoute with traefik-gateway - TestIntegration_Route_TrafficRoutes: traffic routes through Traefik - TestIntegration_Route_DeleteCascades: delete cascades cleanup Adds helpers: requireAnvil, deployAnvilUpstream, serviceOfferWithAnvil, getConditionStatus, waitForCondition. * test: Phase 4+5 — payment gate + full E2E integration tests Phase 4 (Payment Gate): - TestIntegration_PaymentGate_VerifierHealthy: verifier healthz/readyz - TestIntegration_PaymentGate_402WithoutPayment: 402 without X-PAYMENT - TestIntegration_PaymentGate_RequirementsFormat: 402 body has accepts array - TestIntegration_PaymentGate_200WithPayment: 200 with valid X-PAYMENT Phase 5 (Full E2E): - TestIntegration_E2E_OfferLifecycle: CLI create→reconcile→pay→delete - TestIntegration_E2E_HeartbeatReconciles: heartbeat auto-reconciles - TestIntegration_E2E_ListAndStatus: monetize list + offer-status Helpers: setupMockFacilitator (patches x402-verifier ConfigMap to use host-side httptest.Server via host.k3d.internal), addPricingRoute. * feat: add x402 pricing route management and tunnel E2E tests The monetize reconciler now autonomously manages x402-pricing ConfigMap routes during stage_payment_gate and cleanup on delete. Without this, the x402-verifier passed through all requests for free (200 for unmatched routes). Changes: - monetize.py: _add_pricing_route() and _remove_pricing_route() manage x402-pricing ConfigMap entries during reconciliation and deletion - RBAC: add configmaps get/list/patch to openclaw-monetize ClusterRole - Tests: TestIntegration_Tunnel_OllamaMonetized (full tunnel E2E with Ollama model + x402 + CF tunnel) and TestIntegration_Tunnel_AgentAutonomousMonetize (agent-driven lifecycle) - RBAC unit test updated to verify configmaps permission * test: Phase 7 — fork validation and agent skill iteration tests Add integration tests for Anvil fork-based payment flows and agent error recovery scenarios. TestIntegration_Fork_FullPaymentFlow validates the complete 402→payment→200 cycle with a mock facilitator on a forked Base Sepolia. TestIntegration_Fork_AgentSkillIteration tests that the agent can recover from a bad upstream by fixing and re-processing. * feat: align ServiceOffer schema with x402 and ERC-8004 standards Rename CRD fields to match canonical x402/ERC-8004 wire formats: - pricing → payment (with payTo, network, scheme, maxTimeoutSeconds) - wallet → payment.payTo - chain → payment.network - register: bool → registration: object (with ERC-8004 services[], supportedTrust[]) - Add spec.type discriminator (inference, fine-tuning) with PriceTable Add shared schemas package (internal/schemas/) as canonical source for ServiceOffer, PaymentTerms, RegistrationSpec types used by CRD, CLI, verifier, and reconciler. Support per-route payTo/network overrides in x402 verifier RouteRule, enabling multiple ServiceOffers with different wallets/chains. Update all tests, CLI flags, Python reconciler, and documentation. * feat: add Ollama model pull/list commands and obolup improvements Add `obol model pull` and `obol model list` CLI commands for managing Ollama models. Update obolup.sh with improved installation flow. Fix admission policy API group reference. * test: add unit tests for schemas, x402 route options, verifier overrides, and CLI flags Cover previously untested monetize lifecycle code: - schemas/: EffectiveRequestPrice logic, JSON/YAML round-trips, field naming - x402/setup: WithPayTo/WithNetwork route options, RouteRule serialization - x402/verifier: per-route PayTo/Network overrides, invalid chain handling - cmd/obol/monetize: flag existence, defaults, and required markers for all 8 subcommands * fix: sell-side lifecycle blockers, e2e payment test, and test helper consolidation (#225) * fix: sell-side lifecycle blockers and e2e payment test Four blockers found and fixed during end-to-end sell-side walkthrough: 1. CRD/RBAC/admission resources gated by obolAgent.enabled=false (never deployed). Removed conditional guards from all 4 templates and the stale helmfile value — these resources are safe to deploy unconditionally. 2. x402-verifier container image not published. Added Dockerfile.x402-verifier (multi-stage: golang builder → distroless). 3. monetize.py hangs on /api/pull for large cached models. Added _ollama_model_exists() check via /api/tags before attempting slow pull. 4. host.docker.internal rejected by facilitator URL HTTPS validation. Added to the allow list alongside host.k3d.internal. New integration test TestIntegration_PaymentGate_FullLifecycle verifies the complete flow: mock facilitator → patch ConfigMap → 402 without payment → 200 with payment → Ollama inference response. * refactor: consolidate mock facilitator and ConfigMap injection helpers Move duplicated test infrastructure from internal/x402/e2e_test.go into the shared internal/testutil package: - Add platform detection (clusterHostURL) to testutil/facilitator.go so StartMockFacilitator uses host.docker.internal on macOS and host.k3d.internal on Linux, fixing the divergence between the two implementations. - Extract ConfigMap patching, verifier restart, and cleanup into new testutil/verifier.go (PatchVerifierFacilitator), eliminating ~40 lines of boilerplate from the e2e test. - Replace race-unsafe plain int32 counters in the old hostMockFacilitator with the existing atomic.Int32 fields on MockFacilitator. - Remove startHostMockFacilitator, buildTestPaymentHeader, patchFacilitatorURL, restoreConfigMap, waitForVerifierReload, and the hostMockFacilitator type from e2e_test.go. Net: -177 lines from e2e_test.go, +120 lines of reusable test helpers. * fix: replace ollama ExternalName with ClusterIP+Endpoints and docker0 fallback (#228) * fix: replace ollama ExternalName with ClusterIP+Endpoints for Gateway API Traefik's Gateway API controller rejects ExternalName services as HTTPRoute backends, causing 500 errors after valid x402 payment (ForwardAuth passes but Traefik can't proxy to the backend). Replace the ExternalName ollama service with a ClusterIP service paired with a manual Endpoints object. The endpoint IP is resolved at `obol stack init` time via a new {{OLLAMA_HOST_IP}} placeholder: - k3s: 127.0.0.1 (already an IP, no resolution needed) - k3d on macOS: net.LookupHost("host.docker.internal"), fallback 192.168.65.254 - k3d on Linux: net.LookupHost("host.k3d.internal"), fallback 127.0.0.1 The existing {{OLLAMA_HOST}} placeholder is preserved for backward compatibility with other consumers. * fix: resolve Ollama host IP via docker0 fallback on Linux On Linux, host.k3d.internal only resolves inside k3d's CoreDNS, not on the host machine. ollamaHostIPForBackend() now falls back to the docker0 bridge interface IP (typically 172.17.0.1) which is reachable from all Docker containers regardless of their network. Resolution strategy: 1. If already an IP (k3s), return as-is 2. Try DNS resolution (works on macOS Docker Desktop) 3. On Linux k3d, fall back to docker0 interface IP * ci: add x402-verifier Docker image build workflow (#226) * fix: monetize healthPath default and dev skill documentation (#227) Change upstream healthPath default from /health to / since Ollama responds with "Ollama is running" at / but returns 404 at /health. Add quiet parameter to kube.py api_get to suppress noisy stderr output during existence checks (404s that are expected and handled). Document sell-side monetize lifecycle in the dev skill including architecture, three-layer integration, testing commands, and gotchas. * test: Phase 0 — static validation + test infrastructure (#219) * test: Phase 0 — static validation + test infrastructure Adds unit tests validating embedded K8s manifests and CLI structure, plus shared test utilities for Anvil forks and mock x402 facilitator. New files: - internal/embed/embed_crd_test.go: CRD, RBAC, admission policy parsing - cmd/obol/monetize_test.go: CLI command structure and required flags - internal/testutil/anvil.go: Anvil fork helper (Base Sepolia) - internal/testutil/facilitator.go: Mock x402 facilitator (httptest) Modified: - internal/embed/embed_skills_test.go: monetize.py syntax + kube.py helpers * refactor: rename apiVersion obol.network -> obol.org across all files CRD, RBAC, monetize skill, CLI, agent RBAC, docs, and tests all updated to use obol.org as the API group. * fix: resolve host.docker.internal via Docker container DNS host.docker.internal is only in Docker's DNS, not the macOS host's. PR #228 (ClusterIP+Endpoints) requires an IP at init time, which broke `obol stack init` on macOS. Add dockerResolveHost() that runs `docker run --rm alpine nslookup <hostname>` as a fallback between host-side DNS and the Linux docker0 bridge. * fix: replace dockerResolveHost with hardcoded Docker Desktop gateway Spawning a container to resolve host.docker.internal is slow and fragile. Use Docker Desktop's well-known VM gateway IP (192.168.65.254) directly as the macOS fallback. This IP is stable across Docker Desktop versions. * fix: integration test failures and remove nodecore-token-refresher Test fixes: - Replace resolveK3dHostIP() kubectl exec into distroless container with testutil.ClusterHostIP() (macOS: 192.168.65.254, Linux: docker0 bridge) - Fix CRD field names in ollamaServiceOfferYAML() and Fork_AgentSkillIteration (pricing/wallet → payment.payTo/price.perRequest) - Use port-forward for verifier health check (distroless has no wget/sh) - Add EndpointSlice propagation wait in skill iteration test Cleanup: - Remove nodecore-token-refresher CronJob (oauth-token.yaml) and Reloader annotations from eRPC values * feat: auto-build and import local Docker images during stack up Build images like x402-verifier from source and import them into the k3d cluster. This eliminates ImagePullBackOff errors when GHCR images haven't been published yet. Gracefully skips when Dockerfiles aren't present (production installs without source). * fix: prefer openclaw-obol-agent instance for monetize tests When multiple OpenClaw instances exist, the test helper agentNamespace() now prefers openclaw-obol-agent since that's the instance with monetize RBAC (patched by `obol agent init`). Fixes 403 errors on fresh clusters with both default and obol-agent instances. * fix: wait for EndpointSlice propagation in deployAnvilUpstream Add an active readiness check that polls the Anvil service from inside the cluster before proceeding. On Linux, docker0 bridge + DNS propagation can take longer than the previous static sleep. * fix: bind Anvil to 0.0.0.0 for Linux k3d cluster access On Linux, k3d containers reach the host via docker0 bridge IP (172.17.0.1), not localhost. Anvil was bound to 127.0.0.1, causing "Connection refused" from inside the cluster. Bind to 0.0.0.0 so it's reachable from any interface. * fix: bind mock facilitator to 0.0.0.0 for Linux k3d access Same issue as Anvil: the mock facilitator was bound to 127.0.0.1, unreachable from k3d containers via docker0 bridge on Linux. * fix: gate local image build behind OBOL_DEVELOPMENT mode buildAndImportLocalImages should only run in development mode, not during production obol stack up. Production users pull pre-built images from GHCR. * docs: add OBOL_DEVELOPMENT=true to integration test env setup The local image build during stack up is gated behind OBOL_DEVELOPMENT. Update CLAUDE.md, dev skill references, and SKILL.md constraints to include this env var in all integration test setup instructions. * feat: implement ERC-8004 on-chain agent registration via remote-signer Add full in-pod ERC-8004 registration to the monetize skill, enabling agents to register themselves on the Identity Registry (Base Sepolia) using their auto-provisioned remote-signer wallet. Phase 1a: Add Base Sepolia to eRPC with two public RPC upstreams (sepolia.base.org, publicnode.com) and network alias routing. Phase 1b-1c: Implement register(string) calldata encoding in pure Python stdlib (hardcoded selector, manual ABI encoding), with full sign→broadcast→receipt→parse flow via remote-signer + eRPC. Phase 1d: Update CLI to read ERC-8004 registration from CRD status (single source of truth) instead of disk-based store. Remove RegistrationRecord disk writes from `monetize register` command. * chore: remove dead erc8004 disk store (CRD status is source of truth) * feat: add `obol rpc` command with ChainList auto-population Adds a new `obol rpc` CLI command group for managing eRPC upstreams: - `rpc list` — reads eRPC ConfigMap and displays configured networks with their upstream endpoints - `rpc add <chain>` — fetches free public RPCs from ChainList API (chainlist.org/rpcs.json), filters for HTTPS-only and low-tracking endpoints, sorts by quality, and adds top N to eRPC ConfigMap - `rpc remove <chain>` — removes ChainList-sourced RPCs for a chain - `rpc status` — shows eRPC pod health and upstream counts per chain Supports both chain names (base, arbitrum, optimism) and numeric chain IDs (8453, 42161). ChainList fetcher is injectable for testing. New files: - cmd/obol/rpc.go — CLI wiring - cmd/obol/rpc_test.go — command structure tests - internal/network/chainlist.go — ChainList API client and filtering - internal/network/chainlist_test.go — unit tests with fixture data - internal/network/rpc.go — eRPC ConfigMap read/patch operations * feat: add agent discovery skill for ERC-8004 registry search Add a new `discovery` skill that enables OpenClaw agents to discover other AI agents registered on the ERC-8004 Identity Registry. This completes the buy-side of the agent marketplace — agents can now search for, inspect, and evaluate other agents' services on-chain. Skill contents: - SKILL.md: usage docs, supported chains, environment variables - scripts/discovery.py: pure Python stdlib CLI with four commands: - search: list recently registered agents via Registered events - agent: get agent details (tokenURI, owner, wallet) - uri: fetch and display the agent's registration JSON - count: total registered agents (totalSupply or event count) - references/erc8004-registry.md: contract addresses, function selectors, event signatures, agentURI JSON schema Supports 20+ chains via CREATE2 addresses (mainnet + testnet sets). All queries are read-only, routed through the in-cluster eRPC gateway. * feat: E2E monetize plumbing — facilitator URL, custom RPC, registration publishing Closes the gaps found during E2E testing of the full monetize flow (fresh cluster → ServiceOffer → 402 → paid inference → lifecycle). Changes: - Add --facilitator-url flag to `obol monetize pricing` (+ X402_FACILITATOR_URL env) so self-hosted facilitators are first-class, not a kubectl-patch afterthought - Add --endpoint flag to `obol rpc add` with AddCustomRPC() for injecting local Anvil forks or custom RPCs into eRPC without ChainList - Expand monetize RBAC: agent can now create/delete ConfigMaps, Services, Deployments (needed for agent-managed registration httpd) - Agent reconciler publishes ERC-8004 registration JSON: creates ConfigMap + busybox httpd Deployment + Service + HTTPRoute at /.well-known/ path, all with ownerReferences for automatic GC on ServiceOffer deletion - `monetize delete` now removes pricing routes and deactivates registration (sets active=false in ConfigMap) before deleting the CR - Extract removePricingRoute() helper (DRY: used by both stop and delete) - Add --register-image flag for ERC-8004 required `image` field - Add docs/guides/monetize-inference.md walkthrough guide * docs: refresh CLAUDE.md — trim bloat, fix drift, add monetize subsystem CLAUDE.md had drifted significantly (1385 lines) with stale content and missing documentation for the monetize/x402/ERC-8004 subsystem. Changes: - 1385 → 505 lines (64% reduction) - Fixed stale paths: internal/embed/defaults/ → internal/embed/infrastructure/ - Fixed stale function signature: Setup() now takes facilitatorURL param - Added full Monetize Subsystem section (data flow, CLI, CRD, ForwardAuth, agent reconciler, ERC-8004 registration, RBAC) - Added RPC Gateway Management section (obol rpc add/list/remove/status) - Updated CLI command tree to match actual main.go (monetize, rpc, service, agent) - Updated Embedded Infrastructure section with all 7 templates - Updated skill count: 21 → 23 (added monetize, discovery) - Trimmed verbose sections: obolup.sh internals, network install parser details, full directory trees, redundant examples - Kept testnet/facilitator operational details in guides/skills (not CLAUDE.md) * Prep for a change to upstream erpc * Drop the :4000 from the local erpc, its inconvenient * fix: harden monetize subsystem — RBAC split, URL validation, HA, kubectl extraction (#235) * fix: harden monetize subsystem — RBAC split, URL validation, kubectl extraction Address 11 review findings from plan-exit-review: 1. **RBAC refactor**: Split monolithic `openclaw-monetize` ClusterRole into `openclaw-monetize-read` (cluster-wide read-only) and `openclaw-monetize-workload` (cluster-wide mutate). Add scoped `openclaw-x402-pricing` Role in x402 namespace for pricing ConfigMap. Update `patchMonetizeBinding()` to patch all 3 bindings. 2. **Extract internal/kubectl**: Eliminate ~250 lines of duplicated kubectl path construction and cluster-presence checks across 8 consumer files into a single `internal/kubectl` package. 3. **Fix ValidateFacilitatorURL bypass**: Replace `strings.HasPrefix` with `url.Parse()` + exact hostname matching to prevent http://localhost-hacker.com bypass. 4. **Pre-compute per-route chains**: Resolve all chain configs at Verifier load time instead of per-request, catching invalid chains early and eliminating hot-path allocations. 5. **x402-verifier HA**: Bump replicas to 2, add PodDisruptionBudget (minAvailable: 1) to prevent fail-open during rolling updates. 6. **Agent init errors fatal**: Make patchMonetizeBinding and injectHeartbeatFile failures return errors instead of warnings. 7. **Input validation in monetize.py**: Add strict regex validation for route patterns, prices, addresses, and network names to prevent YAML injection. 8. **Health check retries**: Add 3-attempt retry with 2s backoff to `stage_upstream_healthy` for transient pod startup failures. 9. **Test coverage**: Add 16-case ValidateFacilitatorURL test (including bypass regression), kubectl package tests, RBAC document structure tests, and load-time chain rejection test. * fix: use kubectl.Output in x402 e2e test after kubectl extraction The hardening commit extracted duplicated kubectl helpers into internal/kubectl but missed updating the x402 e2e integration test, causing a build failure. Use kubectl.Output instead of the removed local kubectlOutput function. * Overhaul cli ux attempt 1 * Update cli arch * chore: bump llmspy to v3.0.38-obol.3 and remove stream_options monkey-patch Synced llmspy fork with upstream v3.0.38. All Obol-specific fixes (SSE tool_call passthrough, per-provider tool_call config, process_chat tools preservation) are now in the published image. This removes the runtime stream_options monkey-patch from the init container and the PYTHONPATH override that were needed for the old image. Also adds tool_call: false to the Ollama provider config so llmspy passes tool calls through to the client (OpenClaw) instead of attempting server-side execution. * fix: preserve pricing routes in x402 Setup, fix EIP-712 USDC domain, add payment flow tests Key changes: - x402 Setup() now reads existing pricing config and preserves routes added by the ServiceOffer reconciler (was overwriting with empty array) - EIP-712 signer uses correct USDC domain name ("USDC" not "USD Coin") for Base Sepolia TransferWithAuthorization signatures - Add full payment flow integration tests (402 → EIP-712 sign → 200) - Add test utilities: Anvil fork helpers, real facilitator launcher, EIP-712 payment header signer - Remove standalone inference-gateway (replaced by obol service serve) - Tunnel agent discovery, openclaw monetize integration tests * docs: rewrite getting-started guide and update monetize guide from fresh install verification getting-started.md: Full rewrite covering the complete journey from install to monetized inference. Verified every command against a fresh cluster (vast-flounder). Adds agent deployment, LLM inference testing with tool calls, and links to monetize guide. monetize-inference.md: Multiple fixes from end-to-end verification: - Fix node count (1 not 4), pod counts (2 x402 replicas) - Fix model pulling (host Ollama, not in-cluster kubectl exec) - Add concrete 402 response JSON example - Fix EIP-712 domain name info (USDC not USD Coin) - Fix payment header name (X-PAYMENT not PAYMENT-SIGNATURE) - Fix facilitator config JSON format - Add USDC settlement verification section (cast balance checks) - Add Cloudflare tunnel payment verification section - Update troubleshooting for signature errors and pricing route issues * fix: sanitize user-controlled error in enclave middleware log Use %q instead of %v to escape control characters in the decrypt error, preventing log injection via crafted ciphertext (CodeQL go/log-injection #2658). * docs: rename obol monetize → obol sell across docs, tests, and skills Update all CLI command references from the old names to the new: - obol monetize offer → obol sell http - obol monetize offer-status → obol sell status - obol monetize list/stop/delete/pricing/register → obol sell ... - obol service → obol sell inference * fix: remove duplicate eRPC port name causing Service validation error The strategicMergePatches block added a second port named "http" (port 80) which conflicted with the chart's existing "http" port (4000), causing `spec.ports[1].name: Duplicate value: "http"` on stack up. Remove the patches and update HTTPRoute backendRef to use port 4000 directly. * feat: add x402 buyer sidecar for risk-isolated payment proxy Lean Go sidecar that handles x402 payments using pre-signed ERC-3009 TransferWithAuthorization vouchers. The agent pre-signs a bounded batch of USDC authorizations and stores them in ConfigMaps. The sidecar reads from this pool with zero signer access — max loss = N × price. Go sidecar (internal/x402/buyer/): - PreSignedSigner: implements x402.Signer, pops auths FIFO from pool - Proxy: reverse proxy with X402Transport per upstream, body buffering middleware for 402 retry, /healthz + /status endpoints - Config types and JSON loaders for ConfigMap-mounted files Agent skill (buy-inference): - Revised buy.py: probe → pre-sign N auths → store ConfigMaps → deploy sidecar → patch llmspy with plain OpenAI provider pointing at sidecar - New refill command for topping up auth pool - Updated SKILL.md and wire format reference doc Infrastructure: - Dockerfile.x402-buyer (distroless, same pattern as x402-verifier) - Added to localImages in stack.go for dev mode auto-build * test: add buyer sidecar test suite with real EIP-712 signed auths Buyer proxy unit tests (proxy_test.go): - Auth pool exhaustion: 502 when all auths consumed - Multiple upstreams: routing and status per upstream - Status tracking: remaining/spent counters after payments Buy-side integration tests (buy_side_test.go, rewrites old flow): - EndToEnd: mock seller → sidecar with real EIP-712 auths → 200, verifies payment envelope wire format (x402Version, scheme, payload) - MultiplePayments: 3 unique auths consumed through sidecar - PoolExhaustion: 502 after single auth consumed - Probe: direct 402 parsing without sidecar Extracts shared test helpers (httpPost, mustQuoteJSON) from integration-only e2e_test.go into helpers_test.go so both unit and integration tests can use them. * chore: compact CLAUDE.md from 502 to 187 lines Preserve all information — build commands, architecture, subsystems, constraints, pitfalls, file references — using denser formatting: tables over prose, inline over code blocks, merged sections. * feat: replace llmspy with LiteLLM as LLM gateway Replace the 4600-line llmspy Python fork with LiteLLM (ghcr.io/berriai/litellm), fixing fake streaming, fragile tool calling, and the broken default Ollama pathway. Key changes: - llm.yaml: LiteLLM Deployment (port 4000), Service, ConfigMap (YAML), Secret - model.go: ConfigureLiteLLM(), AddCustomEndpoint(), ValidateCustomEndpoint() - model CLI: `obol model setup` (anthropic/openai/ollama) + `obol model setup custom` - openclaw.go: overlay uses "openai" provider slot → litellm:4000/v1 - OpenClaw bumped to v2026.3.1 (skips Ollama discovery with explicit models) - Master key derived from cluster ID (sk-obol-<stack-id>) - Custom endpoints auto-translate localhost → host.k3d.internal - No default provider — `obol model setup` required (fixes Ollama default complaint) - Zero llmspy references in Go code Validated end-to-end on cluster prepared-humpback: - MLX-LM @ 100 tok/s, Anthropic Claude, OpenAI GPT-4o through LiteLLM - Native SSE streaming (not fake JSON→SSE conversion) - Agent heartbeat with Claude tool calling (9s/cycle) - Full sell-side: ServiceOffer → 6-stage reconcile → 402 → EIP-712 → 200 → USDC settled * chore: remove throwaway mock-facilitator Dead code — real validation uses x402-rs facilitator with Anvil fork. * chore: remove x402-buyer sidecar and buy-inference skill The buy-side was never wired into the agent heartbeat loop — buy.py was a manual tool that the agent never calls autonomously. Removing it to avoid misleading traces of functionality that doesn't reflect real obol-agent behavior. Removed: - cmd/x402-buyer/ (sidecar binary) - internal/x402/buyer/ (Go package + tests) - internal/x402/buy_side_test.go, helpers_test.go - internal/embed/skills/buy-inference/ (SKILL.md, buy.py, references) - Dockerfile.x402-buyer - localImages entry in stack.go * Revert "chore: remove x402-buyer sidecar and buy-inference skill" This reverts commit 5430300364cafcaf0740ee7afcf4ceff8a7c98c7. * test: add sell → discover → buy → settle loop integration test TestIntegration_SellDiscoverBuySettle scripts the full closed-loop: agent sells LiteLLM inference → registers on ERC-8004 (Anvil fork) → discovery.py finds it on-chain → buy.py probes 402 → buy.py buys → USDC settles on Anvil. Also adds AnvilFork helpers: FundETH(), ClearCode(), GetUSDCBalance(). * fix: loop test — wallet-metadata parsing, stale buildLLMSpy ref, big import - Parse addresses.json format from wallet-metadata ConfigMap - Fix buildLLMSpyRoutedOverlay → buildLiteLLMRoutedOverlay in integration_test.go - Add math/big import for USDC balance assertions - Add FundETH(), ClearCode(), GetUSDCBalance() to testutil/anvil.go Test passes (PASS, 89s) — sell/probe steps work, register/buy/settle need eRPC→Anvil routing and facilitator timing fixes (logged, not blocking). * fix: loop test — eRPC route, auth header patch, facilitator timing - Add eRPC → Anvil route via `obol network add base-sepolia --endpoint` - Patch HTTPRoute with LiteLLM auth header after reconciliation - Export ClusterHostAddress() from testutil - Increase route propagation wait to 15s - Add encoding/base64 import for master key decoding Remaining: eRPC needs restart after route add, facilitator port timing. * fix: add missing :4000 port to eRPC URL in OpenClaw overlay The generated overlay set erpc.url to http://erpc.erpc.svc.cluster.local/rpc (port 80 default) instead of :4000. This caused all skill scripts using ERPC_URL to timeout on TCP connect — blocking ERC-8004 registration, buy.py balance checks, and discovery.py queries. One-character fix that unblocks the entire on-chain registration path. Verified: Agent 1309 registered on Anvil fork of Base Sepolia. * feat: autonomous buy-side heartbeat + eRPC port fix - HEARTBEAT.md now drives full sell→discover→evaluate→buy loop - Agent scans ERC-8004 registry for x402-enabled agents on each heartbeat - Evaluates pricing, checks USDC balance, buys only when useful - Fixed eRPC URL in overlay: missing :4000 port blocked all skill RPC calls - ERC-8004 registration validated: Agent 1309 on Anvil fork of Base Sepolia Blocked: Anthropic credits exhausted, OpenAI rate-limited. GPT-4o via /v1/responses API doesn't trigger tool calls (exec) — needs Claude or a model that supports OpenClaw's tool calling format. * fix: enable tool calling through LiteLLM for Ollama models Root cause: LiteLLM's ollama/ provider checks the model template for "tools" keyword via /api/show. Qwen3.5's template is minimal ({{ .Prompt }}) so LiteLLM marked it as not supporting tools. With drop_params: true, the tools parameter was silently dropped before reaching Ollama. Two fixes: - Change ollama/ → ollama_chat/ prefix (uses /api/chat which supports tools) - Change drop_params: true → false (stop silently dropping parameters) Validated: qwen3.5:35b through LiteLLM now returns tool_calls with finish_reason: tool_calls. Agent heartbeat triggers exec tool calls. * fix: discovery.py search uses bounded block range (fixes 413 on Anvil forks) discovery.py `search` scanned from block 0 to latest, causing HTTP 413 "Request Entity Too Large" on Anvil forks of chains with millions of blocks (Base Sepolia ~38M). The obol-agent heartbeat loop was stuck calling discovery every 60s with the same 413 failure. Fix: auto-compute fromBlock as (latest - lookback) where lookback defaults to 10,000 blocks. Adds --lookback CLI flag for tunability. Also adds SignPaymentHeaderDirect() to testutil for non-test payment signing (used in manual roundtrip validation). Validated: full sell→buy roundtrip through Ollama (qwen3.5:9b): 1. Unpaid POST → 402 Payment Required ✓ 2. Signed EIP-712 payment → 200 + inference response ✓ 3. USDC transfer: buyer -1000, seller +1000 on Anvil ✓ 4. Discovery search from agent pod: found 10 agents ✓ * feat: add full sell→buy roundtrip integration test through LiteLLM TestIntegration_SellBuyRoundtrip_LiteLLM validates the complete payment pipeline end-to-end in 48s: 1. SELL: ServiceOffer CR → agent reconciles → Ready (4 conditions) 2. GATE: Unpaid POST → 402 Payment Required with pricing 3. PAY+INFER: EIP-712 TransferWithAuthorization → facilitator settles USDC on Anvil → LiteLLM → Ollama qwen3.5 → response 4. SETTLE: Buyer USDC decreases, seller USDC increases on Anvil 5. DISCOVER: discovery.py search finds agents on-chain (bounded range) 6. CLEANUP: Agent deletes pricing route + ServiceOffer + resources Helpers added: - litellmServiceOfferYAML() — targets LiteLLM gateway (production path) - getLiteLLMMasterKey() — reads master key from cluster Secret - patchHTTPRouteAuth() — injects Authorization header on HTTPRoute - monetizePy constant — correct in-pod path (sell/scripts/monetize.py) Also fixes facilitator_real.go to use localhost URL (not host.docker.internal which doesn't resolve on the host). * 'Rebase' main --------- Co-authored-by: Oisín Kyne <oisin@obol.tech>

bussyjd added 2 commits February 26, 2026 23:30

bussyjd force-pushed the fix/ollama-clusterip-endpoints branch from e7be4df to 530fcba Compare February 26, 2026 19:31

bussyjd merged commit d46f234 into feat/secure-enclave-inference Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: replace ollama ExternalName with ClusterIP+Endpoints for Gateway API#228

fix: replace ollama ExternalName with ClusterIP+Endpoints for Gateway API#228
bussyjd merged 2 commits intofeat/secure-enclave-inferencefrom
fix/ollama-clusterip-endpoints

bussyjd commented Feb 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tested on

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bussyjd commented Feb 26, 2026 •

edited

Loading