integration: validate tunnel onboarding with live OBOL faucet flow#452
Conversation
- prefer current Go path for nohup/cron flow execution - poll for remote-signer pod creation before age checks - allow cooldown-safe flow-15 reruns when Bob already has faucet OBOL - poll post-claim balances to tolerate public RPC state lag
…in' into integration/pr450-pr451-cloudflare-obol
|
Exact-head validation is now aligned with the pushed commit Local validation summary:
The PR body now includes the exact-head smoke matrix and the live/fork OBOL receipt hashes from One remaining cleanup follow-up I observed during the exact-head run: the smoke harness still left |
1 similar comment
|
Exact-head validation is now aligned with the pushed commit Local validation summary:
The PR body now includes the exact-head smoke matrix and the live/fork OBOL receipt hashes from One remaining cleanup follow-up I observed during the exact-head run: the smoke harness still left |
|
Final status:
Remaining noted follow-up: the smoke harness still left |
Co-authored-by: bussyjd <bussyjd@users.noreply.github.com>
* feat(buy): add `obol buy inference` host CLI Mirrors `obol sell inference` on the buyer side. The host CLI handles default-seller resolution, ERC-8004 identity pre-flight, and USDC->micro-units conversion, then dispatches to the existing `buy.py buy` skill in the obol-agent pod. Single canonical wallet, no host-side keystore. - internal/x402/setup.go: DefaultBuySellerURL, DefaultBuySellerAgentID, DefaultBuySellerChain placeholders (TODO: wire live values once the default seller is provisioned). - internal/agentruntime/exec.go: ExecInPod + BuildExecArgs generalize the kubectl-exec helper that was hardcoded to the hermes binary. - internal/hermes/hermes.go: cliViaKubectlExec + hermesExecArgs delegate to the new agentruntime helpers; existing test stays valid. - internal/buy/discover.go: .well-known/agent-registration.json fetcher and ERC-8004 agentId verification (hard-fail on mismatch). - cmd/obol/buy.go: `obol buy inference [<name>] --seller --model --budget --expected-agent-id --no-verify-identity --auto-refill ...`. * test(flow-11): validate host buy inference on integration
|
Update: integration branch Host-buy validation used the exact pre-merge integrated head Targeted checks
Live flow evidenceRan
Key assertions that passed:
Exact-head receipts:
Flow result:
Artifacts from that exact-head run:
|
* Agent crd * Next phase * 1, 2a, 2b, 2c, 4a, 4b, 5, 6, 7, 8, 9 * 2d * Update with almost all complete, time for testing * Bug fixing * chore: remove stray runtime log * chore(flows): renumber sell-agent smoke flow for integration * fix(agent): harden CRD update sync semantics --------- Co-authored-by: bussyjd <bussyjd@users.noreply.github.com> Co-authored-by: bussyjd <jd@obol.tech>
Both versions were intended to land via the integration branch behind PR #452 but did not make it through the squash merges. Aligning main with the latest published tags. - frontend: v0.1.21-rc1 → v0.1.23 (real release, off the rc) - hermes-agent: v2026.4.30 → v2026.5.7 - justfile dev-frontend-reset target: v0.1.19 → v0.1.23
Pulls forward five small correctness fixes that were carried on the integration branch behind #452 but did not survive the squash merges. - Re-queue offers when their referenced Agent changes. Without this an Agent status edit (e.g. status.pinnedModel after the user edits spec.model) never propagates into the offer's status.agentResolution because the offer reconciler only runs when the offer itself changes. - Refuse to Update Namespace and PersistentVolumeClaim during applyAgentObject. PVCs reject wholesale Update with "spec is immutable after creation", and the controller's RBAC only grants `create` on Namespaces. Treat existence as success for these kinds and move on; mutable kinds (ConfigMap, Secret, Deployment, Service, ServiceAccount) keep going through the normal Update path. - Fall back to status.agentResolution.Model in the storefront catalog when an offer's spec.model is empty (the canonical state for type=agent offers, where the model lives on the linked Agent). - Bump the serviceoffer-controller Deployment memory request from 64Mi to 128Mi and the limit from 256Mi to 512Mi. The Agent informer + agent reconciler + in-controller keystore generation pushed steady-state past 256Mi after #453 and triggered OOMKilled restart loops. - Set GATEWAY_ALLOW_ALL_USERS=true on CRD-rendered agent pods. CRD agents only expose the API (gated by API_SERVER_KEY + ForwardAuth); no Telegram/Discord/dashboard platforms are wired. The flag silences Hermes' user-gateway startup warning without opening any real surface.
Pulls forward five small correctness fixes that were carried on the integration branch behind #452 but did not survive the squash merges. - Re-queue offers when their referenced Agent changes. Without this an Agent status edit (e.g. status.pinnedModel after the user edits spec.model) never propagates into the offer's status.agentResolution because the offer reconciler only runs when the offer itself changes. - Refuse to Update Namespace and PersistentVolumeClaim during applyAgentObject. PVCs reject wholesale Update with "spec is immutable after creation", and the controller's RBAC only grants `create` on Namespaces. Treat existence as success for these kinds and move on; mutable kinds (ConfigMap, Secret, Deployment, Service, ServiceAccount) keep going through the normal Update path. - Fall back to status.agentResolution.Model in the storefront catalog when an offer's spec.model is empty (the canonical state for type=agent offers, where the model lives on the linked Agent). - Bump the serviceoffer-controller Deployment memory request from 64Mi to 128Mi and the limit from 256Mi to 512Mi. The Agent informer + agent reconciler + in-controller keystore generation pushed steady-state past 256Mi after #453 and triggered OOMKilled restart loops. - Set GATEWAY_ALLOW_ALL_USERS=true on CRD-rendered agent pods. CRD agents only expose the API (gated by API_SERVER_KEY + ForwardAuth); no Telegram/Discord/dashboard platforms are wired. The flag silences Hermes' user-gateway startup warning without opening any real surface.
…odel pin Pulls forward three dev-experience improvements from the integration branch behind #452 that did not survive the squash merges. - Selective image rebuild via OBOL_FORCE_REBUILD_LOCAL_DEV_IMAGES. The variable now accepts a comma-separated list of image short names (e.g. `x402-verifier,serviceoffer-controller`) in addition to the existing `true`/`all` and `false`/`0`/unset behaviours. The full image set is x402-verifier, serviceoffer-controller, x402-buyer, demo-server, and obol-stack-public-storefront (with `public-storefront` accepted as an alias). Saves a full ~10-minute rebuild when only one image changed. - Claude Code plugin install tip on stack up. After `obol stack up`, if the `claude` CLI is present but the ObolNetwork/skills marketplace or its plugin isn't installed, surface a one-line install hint. Reads ~/.claude/plugins/{known_marketplaces,installed_plugins}.json best-effort; silently no-ops on any error so a malformed Claude config can never block stack up. - Auto-pin a model on the agent-backed demo. `obol sell agent --demo` resolves the first non-`paid/*` model from the cluster's LiteLLM config (the same source `obol model list` reads) and writes it into the rendered Agent's spec.model so the controller doesn't park at ModelUnpinned. Returns a clear "configure a model first" error if the cluster has nothing usable, and removes a stale "depend on step 2d" caveat that no longer applies. Docs updated in CLAUDE.md, .agents/skills/obol-stack-dev/SKILL.md, and .agents/skills/obol-stack-dev/references/dev-environment.md.
…odel pin Pulls forward three dev-experience improvements from the integration branch behind #452 that did not survive the squash merges. - Selective image rebuild via OBOL_FORCE_REBUILD_LOCAL_DEV_IMAGES. The variable now accepts a comma-separated list of image short names (e.g. `x402-verifier,serviceoffer-controller`) in addition to the existing `true`/`all` and `false`/`0`/unset behaviours. The full image set is x402-verifier, serviceoffer-controller, x402-buyer, demo-server, and obol-stack-public-storefront (with `public-storefront` accepted as an alias). Saves a full ~10-minute rebuild when only one image changed. - Claude Code plugin install tip on stack up. After `obol stack up`, if the `claude` CLI is present but the ObolNetwork/skills marketplace or its plugin isn't installed, surface a one-line install hint. Reads ~/.claude/plugins/{known_marketplaces,installed_plugins}.json best-effort; silently no-ops on any error so a malformed Claude config can never block stack up. - Auto-pin a model on the agent-backed demo. `obol sell agent --demo` resolves the first non-`paid/*` model from the cluster's LiteLLM config (the same source `obol model list` reads) and writes it into the rendered Agent's spec.model so the controller doesn't park at ModelUnpinned. Returns a clear "configure a model first" error if the cluster has nothing usable, and removes a stale "depend on step 2d" caveat that no longer applies. Docs updated in CLAUDE.md, .agents/skills/obol-stack-dev/SKILL.md, and .agents/skills/obol-stack-dev/references/dev-environment.md.
…image
Verified locally against ghcr.io/obolnetwork/remote-signer:v0.3.0:
- Main's KEYSTORE_PASSWORD env name is unrecognised; the binary exits
with Error: NoPassword on startup.
- Main's keystore dir /keystores conflicts with the image's default
/data/keystores (declared as a volume in the image config).
- Main's /health readiness probe returns HTTP 404; the binary only
serves /healthz, which returns {"status":"ok"}.
Together these mean any Agent CR with wallet.create=true on main has a
remote-signer that crash-loops or fails liveness, blocking the agent
from ever reaching Ready.
This is what the integration branch behind #452 was carrying. Pulling
it forward:
- Move keystore dir to /data/keystores (the image default), and pin
the on-disk filename to keystore.json so the Secret volume
projection no longer needs to thread the V3 UUID through; the V3
document carries the address internally so the cosmetic filename
doesn't matter.
- Add ensureCanonicalKeystoreKey migration helper: on reconcile of an
existing Secret with the wallet annotation, if data is keyed under
the old UUID-named JSON field, rewrite it as keystore.json
in-place. Refuses ambiguous Secrets with multiple legacy JSON keys.
- Switch env scheme to upstream's SIGNER__SECTION__KEY hierarchy
(SIGNER__SERVER__HOST, SIGNER__SERVER__PORT, SIGNER__KEYSTORE__DIR,
SIGNER__KEYSTORE__PASSWORD, SIGNER__LOGGING__FORMAT/LEVEL). Matches
the master agent's working config in hermes-obol-agent.
- Switch readiness and liveness probes from /health to /healthz.
Adds 8 unit tests covering fresh keystore creation, reuse, legacy key
migration, ambiguity rejection, malformed data, and the canonical
Secret/Deployment shape (single keystore.json projected, password
read via env, never mounted).
…image
Verified locally against ghcr.io/obolnetwork/remote-signer:v0.3.0:
- Main's KEYSTORE_PASSWORD env name is unrecognised; the binary exits
with Error: NoPassword on startup.
- Main's keystore dir /keystores conflicts with the image's default
/data/keystores (declared as a volume in the image config).
- Main's /health readiness probe returns HTTP 404; the binary only
serves /healthz, which returns {"status":"ok"}.
Together these mean any Agent CR with wallet.create=true on main has a
remote-signer that crash-loops or fails liveness, blocking the agent
from ever reaching Ready.
This is what the integration branch behind #452 was carrying. Pulling
it forward:
- Move keystore dir to /data/keystores (the image default), and pin
the on-disk filename to keystore.json so the Secret volume
projection no longer needs to thread the V3 UUID through; the V3
document carries the address internally so the cosmetic filename
doesn't matter.
- Add ensureCanonicalKeystoreKey migration helper: on reconcile of an
existing Secret with the wallet annotation, if data is keyed under
the old UUID-named JSON field, rewrite it as keystore.json
in-place. Refuses ambiguous Secrets with multiple legacy JSON keys.
- Switch env scheme to upstream's SIGNER__SECTION__KEY hierarchy
(SIGNER__SERVER__HOST, SIGNER__SERVER__PORT, SIGNER__KEYSTORE__DIR,
SIGNER__KEYSTORE__PASSWORD, SIGNER__LOGGING__FORMAT/LEVEL). Matches
the master agent's working config in hermes-obol-agent.
- Switch readiness and liveness probes from /health to /healthz.
Adds 8 unit tests covering fresh keystore creation, reuse, legacy key
migration, ambiguity rejection, malformed data, and the canonical
Secret/Deployment shape (single keystore.json projected, password
read via env, never mounted).
`resolveAssetTermsFor` returned `--token X is not available on chain Y (supported tokens: OBOL, USDC)` when a token wasn't registered for the requested chain. The "supported tokens" list came from the global registry (`SupportedTokens()`), not from the chain, so operators reading the error saw OBOL listed as supported even though the lookup just failed on `base-sepolia`/`base`/etc. This was actively misleading. Surfaced today on spark2 while wiring `obol sell inference … --token OBOL --chain base-sepolia` — the binary (v0.9.0) rejected OBOL on base-sepolia (registry entry added in #452 after the release was cut), but the message claimed OBOL was supported. Changes: - Add `TokensOnChain(chain)` and `ChainsForToken(token)` helpers in internal/x402/tokens.go so callers can ask the registry chain-scoped questions without iterating it themselves. - Rewrite the error in `resolveAssetTermsFor` to use both: `--token OBOL is not available on chain base-sepolia; tokens on base-sepolia: OBOL, USDC; OBOL is registered on: base-sepolia, ethereum` with four branches covering the chain-empty, token-empty, both-empty, and normal cases. - Add table-driven tests covering the helpers (chains/tokens lookups, aliases, unknown chain/token, case-insensitive token names). Co-authored-by: bussyjd <bussyjd@users.noreply.github.com>
Summary
What changed:
stack down/stack purge) and preserves cached stack IDs so k3d fallback cleanup still works when purge removes configrubyis unavailablegemma4-fastpath green across the integrated smoke, live Base Sepolia OBOL, and forked-OBOL flowsWhy it matters:
13ba63c17b701fafe42606501125e309768da9bbis now green for the full smoke suite, includingflow-11,flow-14, andflow-13Risk level: medium
Commit under test:
13ba63c17b701fafe42606501125e309768da9bbBase branch:
mainScope
Validation
CI checks:
lint-testAnalyze (actions)Analyze (go)Analyze (javascript-typescript)Analyze (python)Pre-commit / local correctness checks:
Exact-head release smoke:
Inline smoke report summary:
flow-01-prerequisitesflow-02-stack-init-upflow-03-inferenceflow-04-agentflow-05-networkflow-06-sell-setupflow-07-sell-verifyflow-10-anvil-facilitatorflow-08-buyflow-09-lifecycleflow-11-dual-stackflow-14-live-obol-base-sepoliaflow-13-dual-stack-obolArtifacts from the exact-head run:
.tmp/release-smoke-20260509-165243/RELEASE_REPORT.md.tmp/release-smoke-20260509-165243/flow-11-receipts/receipt-summary.json.tmp/release-smoke-20260509-165243/flow-14-receipts/receipt-summary.json.tmp/release-smoke-20260509-165243/flow-13-receipts/receipt-summary.jsonLive Chain Evidence
Network:
84532)RPC/provider:
https://base-sepolia-rpc.publicnode.comFacilitator:
https://x402.gcp.obol.techhttp://127.0.0.1:53788Contracts and tokens:
0x8004A818BFB912233c491871b3d84c89A494BD9e0x0a09371a8b011d5110656ceBCc70603e53FD2c78Obol Network / OBOL / 18 decimals0x210BBd033630e5e611B7922D70b0Caabe64636d9flow-130x000000000022D473030F116dDEE9F6B43aC78BA3Wallet roles:
0xC0De030F6C37f490594F93fB99e2756703c4297E0x57b0eF875DeB5A37301F1640E469a2129Da9490EExact-head transaction evidence:
flow-11-dual-stackexact-head evidence5702https://pottery-arms-horses-tall.trycloudflare.com0x844bb9d8179571aca3f53fd95b5ba33cd4c972538c84a54138cbfdf0ee37604c0xc2a9a72bed2d7cd8311839b4c803e4d950a89704997af017bd76cdbdc774f48d0x651c44cab864ffed001a3fb089a1198fff7b4e04c1093fe7f4ee86fcf5a6ad71flow-14-live-obol-base-sepoliaexact-head evidence5703https://statute-allen-leaf-runs.trycloudflare.com0x0a09371a8b011d5110656ceBCc70603e53FD2c780xd183bb1ecd2993b87afe72e47e266b5b98f34091dc30d73c061a3d6e30917ee10xa919f4b20b9fcfc0b00efd3b3d0c406bbf44ce7066db0489145f5ecf83d43b4f0xa192904a6c415b30cf908de500ff8c8330724b14601cbb9112181a2146deb5767000000000000000 -> 8000000000000000wei (+1000000000000000)4993000000000000000 -> 4992000000000000000wei (-1000000000000000)flow-13-dual-stack-obolexact-head evidencehttps://catering-solid-night-several.trycloudflare.com0xef2d85e801191599dec7ed3790bc74dd7b1c1f9c7f4f63c80b36e012543345820x4e476edc29b0576aff44c48ca39889a9bffa38fc626d7087f00e8ff9637cf8b710000000000000000000 -> 10001000000000000000wei (+1000000000000000)10000000000000000000 -> 9999000000000000000wei (-1000000000000000)Runtime Evidence
QA environment:
3.11.14, Go1.25.5, GitHub CLI2.86.0, Docker29.4.2, kubectl clientv1.35.3gemma4-fastspark1 -> 192.168.100.11:8000so the spark2 endpoint stayed available across Cloudflare SSH flap eventsModel and paid-route evidence:
paid/gemma4-fastreturned HTTP 200 with coherent content inflow-11,flow-14, andflow-131000000000000000wei (0.001 OBOL) from Bob signer to Alice seller1000000000000000wei (0.001 OBOL) from Bob signer to Alice sellerPost-run cleanup state:
flow-13: stopped by the flowdocker rm -f obol-flow10-x402-facilitator) and is called out below as a remaining cleanup follow-upReview Notes
Known gaps:
flow-10/ smoke cleanup still left a helper container (obol-flow10-x402-facilitator) after the exact-head run; I removed it manually after validation. This PR materially improves stale-workspace / cluster cleanup, but there is still one remaining facilitator-container cleanup follow-up.shellcheckstill reports several pre-existing warnings in the flow harness outside the paths touched here.Reviewer focus:
flows/lib.shcached stack-id cleanup pathflows/flow-07-sell-verify.shfail-closed public tunnel eRPC verificationflows/flow-13-dual-stack-obol.shexplicit runtime prereq handling for the YAML patch helper.tmp/release-smoke-20260509-165243/