post-490: fold #487 + #489, strip debug logs, drop SKIP_PULL (smoke 13/13) by bussyjd · Pull Request #492 · ObolNetwork/obol-stack

bussyjd · 2026-05-13T17:41:07Z

Summary

Post-#490 cleanup integration. Folds in two open feature PRs (#487, #489), strips the four debug log statements left over from the release-smoke 2026-05-13 root-cause hunt, drops the now-obsolete X402_FACILITATOR_SKIP_PULL workaround (upstream ObolNetwork/x402-rs#3 fix landed and the prom-overlay 1.4.9 arm64 manifest was republished), aligns flow-13's buy assertion + prompt with flow-14's pattern, and adds a flows/buy-external.sh harness for testing arbitrary external x402 sellers.

Risk level: low–medium. Each merged PR has its own green CI; the integration branch passed a complete release-smoke (13/13) on spark1 with the freshly-republished registry image and no SKIP_PULL workaround.

Commit under test: 7f29a02

Base branch: main

Scope

Code (internal/x402/forwardauth.go, internal/x402/buyer/signer.go, plus everything from feat(stack): paid services survive stack down/up (sell-inference + sell-http resume + storefront fixes) #487 + Persist ERC-8004 agent identity #489)
Charts / manifests (internal/embed/infrastructure/base/templates/agentidentity-crd.yaml from Persist ERC-8004 agent identity #489)
Flows / QA scripts (flows/flow-13-dual-stack-obol.sh, flows/flow-11-dual-stack.sh, flows/lib.sh, new flows/buy-external.sh)
Docs / skills (skill SKILL.md + release-smoke-debugging.md updates; retrospective marked resolved where applicable)
Images / dependencies — no obol-stack image changes; downstream consumer of upstream ObolNetwork/x402-rs 1.4.9 republish
Other

Validation

CI checks: filled after PR opens.

Unit tests:

go test ./cmd/obol/... ./internal/x402/... ./internal/serviceoffercontroller/... \
        ./internal/inference/... ./internal/schemas/... ./internal/erc8004/... \
        ./internal/embed/... ./internal/monetizeapi/... ./internal/defaults/... \
        -count=1
ok    github.com/ObolNetwork/obol-stack/cmd/obol
ok    github.com/ObolNetwork/obol-stack/internal/x402
ok    github.com/ObolNetwork/obol-stack/internal/x402/buyer
ok    github.com/ObolNetwork/obol-stack/internal/serviceoffercontroller
ok    github.com/ObolNetwork/obol-stack/internal/inference
ok    github.com/ObolNetwork/obol-stack/internal/schemas
ok    github.com/ObolNetwork/obol-stack/internal/erc8004
ok    github.com/ObolNetwork/obol-stack/internal/embed
ok    github.com/ObolNetwork/obol-stack/internal/monetizeapi
ok    github.com/ObolNetwork/obol-stack/internal/defaults

Flow tests (release-smoke gate):

Flow	Network	QA machine label	Worktree	Result	Artifacts
flow-01 prerequisites	n/a	spark1 (arm64)	`obol-stack-qa-20260513-135712-post490`	PASS	`release-smoke-20260513-post490-3-no-skip-pull/flow-01-prerequisites.log`
flow-02 stack init/up	k3d local	spark1	same	PASS	`flow-02-stack-init-up.log`
flow-03 inference	k3d local	spark1	same	PASS	`flow-03-inference.log`
flow-04 agent	k3d local	spark1	same	PASS	`flow-04-agent.log`
flow-05 network	k3d + base-sepolia (paid drpc)	spark1	same	PASS	`flow-05-network.log`
flow-06 sell-setup	k3d local	spark1	same	PASS	`flow-06-sell-setup.log`
flow-07 sell-verify	k3d local	spark1	same	PASS	`flow-07-sell-verify.log`
flow-10 anvil-facilitator	anvil-fork base-sepolia + registry-pulled facilitator	spark1	same	PASS	`flow-10-anvil-facilitator.log`
flow-08 buy (USDC anvil-fork)	anvil-fork + local x402-rs (registry-pulled)	spark1	same	PASS	`flow-08-buy.log`
flow-09 lifecycle	k3d local	spark1	same	PASS	`flow-09-lifecycle.log`
flow-11 dual-stack USDC	live Base Sepolia + `x402.gcp.obol.tech`	spark1	same	PASS	`flow-11-dual-stack.log` + `flow-11-receipts/`
flow-14 live OBOL Base Sepolia	live Base Sepolia + `x402.gcp.obol.tech`	spark1	same	PASS	`flow-14-live-obol-base-sepolia.log` + `flow-14-receipts/`
flow-13 dual-stack OBOL anvil-fork	anvil-fork + local x402-rs (registry-pulled)	spark1	same	PASS	`flow-13-dual-stack-obol.log` + `flow-13-receipts/`

Release smoke:

Three runs on the integration branch:

Run	Commit	Result	Notes
`20260513-post490-1`	`8fb42c9`	12/13 PASS, flow-13 FAIL	Stale `remaining=5` exact-match assertion; agent hallucinated `--count 1`.
`20260513-post490-2-after-flow13-fix`	`d1f97e3`	12/13 PASS, flow-08 FAIL	Transient cloudflared "Network is unreachable" from the seller's agent pod to the trycloudflare tunnel hostname.
`20260513-post490-3-no-skip-pull`	`d1f97e3`	13/13 PASS, RC=0	Re-run with `X402_FACILITATOR_SKIP_PULL` unset; flow-10 pulled the freshly-republished arm64 image from ghcr.io and flow-08 completed the full anvil-fork buy roundtrip. Both flow-08 and flow-13 green simultaneously.

ssh spark1
cd /home/claude/obol-stack-qa-20260513-135712-post490
git rev-parse HEAD                                # d1f97e3 (run 3 carried the same fix tip, no further edits)
RELEASE_SMOKE_INCLUDE_OBOL=true \
RELEASE_SMOKE_INCLUDE_OBOL_FORK=true \
OBOL_DEVELOPMENT=true OBOL_NONINTERACTIVE=true \
OBOL_LLM_ENDPOINT=http://127.0.0.1:8000/v1 \
OBOL_LLM_MODEL=qwen36-fast \
RELEASE_SMOKE_RUN_ID=20260513-post490-3-no-skip-pull \
bash flows/release-smoke.sh
# Release smoke passed
# __SMOKE_DONE_RC__=0
# 13/13 PASS, 0 FAIL

Live Chain Evidence

Network: Base Sepolia (eip155:84532) + Anvil fork of the same.

RPC: paid drpc.live load-balancer for Base Sepolia (lb.drpc.live/base-sepolia/[REDACTED], supplied via BASE_SEPOLIA_RPC); in-cluster eRPC routes mainnet, base, base-sepolia, hoodi.

Facilitator: https://x402.gcp.obol.tech (prom-overlay 1.4.9 admin-merged in obol-infrastructure#2612); host.k3d.internal:8404 for the local x402-rs in flow-10/flow-13 (now pulled fresh from ghcr.io, arm64 manifest sha256:b209345c…).

Contracts and tokens:

Name	Address	Version / notes
USDC (Base Sepolia)	0x036CbD53842c5426634e7929541eC2318f3dCF7e	EIP-3009
OBOL (Base Sepolia)	0x0a09371a8b011d5110656ceBCc70603e53FD2c78	Permit2; flow-14 / flow-13
Permit2	0x000000000022D473030F116dDEE9F6B43aC78BA3	canonical
ERC-8004 IdentityRegistry (Base Sepolia)	0x8004A818BFB912233c491871b3d84c89A494BD9e	flow-11/flow-14 register

Wallet roles: Alice = cast wallet address --private-key REMOTE_SIGNER_PRIVATE_KEY; Bob = deterministic 2nd-derived key (0x57b0eF875DeB5A37301F1640E469a2129Da9490E). Pre-seeded into Bob's remote-signer before stack-up; receiver = facilitator key (Obol-operated).

Balances / deltas: per-flow *-receipts/ directories under release-smoke-20260513-post490-3-no-skip-pull-artifacts/.

Transaction receipts:

Purpose	Artifact	Notes
ERC-8004 registration	`flow-11-receipts/registration-receipt.json`	seller → IdentityRegistry, RegistrationRequested
Metadata / service offer	`flow-11-receipts/metadata-receipt.json`	controller-published `/.well-known/agent-registration.json`
Settlement (USDC)	`flow-11-dual-stack.log` (settle tx)	x402-buyer → Alice
Settlement (OBOL live)	`flow-14-live-obol-base-sepolia.log`	x402-buyer → Alice
Settlement (OBOL fork)	`flow-13-receipts/receipt-summary.json`	local x402-rs settlement on Anvil fork

Runtime Evidence

QA environment:

Item	Value
OS / arch	Linux aarch64 (spark1)
Backend	k3d 5.8.3 on Docker, k3s `v1.35.1-k3s1`
Tool versions	go 1.25.3, kubectl 1.35.3, helm 3.20.1, helmfile 1.4.3, foundry anvil `1.5.1-stable`
QA agent/model	Hermes default runtime, model `qwen36-fast` via OpenAI-compatible vLLM on `127.0.0.1:8000`

Images:

Component	Image	Tag / digest	Source
x402-verifier	`ghcr.io/obolnetwork/x402-verifier`	locally rebuilt under `OBOL_DEVELOPMENT=true`	this branch
serviceoffer-controller	`ghcr.io/obolnetwork/serviceoffer-controller`	locally rebuilt	this branch
x402-buyer	`ghcr.io/obolnetwork/x402-buyer`	locally rebuilt	this branch
LiteLLM	`ghcr.io/obolnetwork/litellm-fork`	`sha-c16b156`	embedded
Public facilitator	`ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay`	`1.4.9` (overlay)	`obol-infrastructure#2612`
Local facilitator (flow-10)	`ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay`	`1.4.9` arm64 digest `sha256:b209345c…` (republished)	`ObolNetwork/x402-rs#3` (this branch validates)

Demo readiness:

Item	Status	Notes
Seller visible / registered	green	flow-11 + flow-14 ERC-8004 registration on Base Sepolia
Buyer discovery works	green	flow-11 + flow-14 paid-route discovery + flow-08 `obol buy inference` against in-cluster Alice
Paid route works	green	All paid steps (flow-08, flow-11, flow-13, flow-14) returned HTTP 200
Settlement visible on-chain	green	per-flow `*-receipts/` directories archive the settlement tx hashes + balance deltas

Review Notes

Superseded PRs (auto-close when this merges):

feat(stack): paid services survive stack down/up (sell-inference + sell-http resume + storefront fixes) #487 — feat(stack): paid services survive stack down/up (sell-inference + sell-http resume + storefront fixes) — folded in as 7125662 (one trivial conflict in cmd/obol/sell_test.go resolved, keeping the explanatory comment block).
Persist ERC-8004 agent identity #489 — Persist ERC-8004 agent identity — folded in as 8fb42c9 (three conflicts in cmd/obol/sell_test.go, internal/serviceoffercontroller/render.go, internal/serviceoffercontroller/render_test.go, all of the same shape: kept HEAD where pr-489 had no content at the same line).

Left out intentionally:

fix(stack): harden stale-resource handling and model selection defaults #444 — fix(stack): harden stale-resource handling and model selection defaults. Draft, external contributor, no reviews. Auto-deletes legacy Kubernetes resources in the default namespace based on heuristic name matches — needs explicit security/operator review per the "narrow review boundaries" guidance in CLAUDE.md. The model-default change overlaps the LLM-routing rank-ordering and isn't validated by the smoke. Self-contained code-wise, but not validated.
[DIRTY] feat(skills): scaffold cluster-bootstrap for LAN/cloud k3s #423 — [DIRTY] feat(skills): scaffold cluster-bootstrap for LAN/cloud k3s. Author-tagged [DIRTY] with the consuming chart changes "intentionally deferred." The author explicitly says "Don't run against a production cluster yet." Scaffold-only.

Known gaps:

The obol buy inference call from a single-stack (flow-08) agent pod to the trycloudflare tunnel hostname intermittently returns Connection error: [Errno 101] Network is unreachable. flow-11 and flow-14 mitigate this with explicit CoreDNS NodeHosts mapping after obol stack up; flow-08 doesn't, which is why a single rerun was needed to confirm the path. Consider adding the same DNS override to flow-08 in a follow-up.
flows/buy-external.sh ships in this branch but the live test against https://inference.v1337.org/services/aeon is a separate follow-up (Bob needs ≥ 0.023 OBOL on Base Sepolia, which the harness's Phase 7 preflight will check before running).

Reviewer focus:

flows/lib.sh — confirm the SKIP_PULL knob removal doesn't leave any dangling references; everything else around x402_facilitator_image() is unchanged.
flows/flow-13-dual-stack-obol.sh — diff vs. flows/flow-14-live-obol-base-sepolia.sh:997-1009 should be near-identical for the auth-count assertion + the prompt wording.
flows/buy-external.sh (new, 604 lines) — single-stack Bob-only harness; bypasses ERC-8004 discovery; reuses flows/lib.sh helpers; piped through scrub_secrets.
internal/x402/forwardauth.go + internal/x402/buyer/signer.go — purely log-statement removal; behavior unchanged (CodeQL alerts for log-injection cleared by the corresponding strip).
internal/serviceoffercontroller/render.go + tests — #487's storefront-filter + registration-document changes co-resident with #489's identity_controller pair; smoke confirms they don't collide.

Closes #487 #489.

`obol sell inference` was producing ServiceOffers with three latent defects that together stopped /.well-known/agent-registration.json from ever being routed for inference-typed sells: 1. The inference subcommand never exposed `--register-*` flags (compare with `obol sell http` which does), even though the help text on `--register` says "registration is enabled by default". 2. `buildInferenceServiceOfferSpec` never wrote `spec.registration` onto the offer at all, so the controller's `reconcileRegistrationStatus` saw `Enabled=false` (zero value) and emitted `Registered=True Disabled` with no RegistrationRequest CR and no /.well-known HTTPRoute. 3. `buildInferenceServiceOfferSpec` hardcoded `spec.model.name = "ollama"` regardless of the actual `--model` value, so anything downstream that keyed off the model id (the controller's description default included) was looking at the wrong string. Surfaced today on spark2 while trying to fetch `https://inference.v1337.org/.well-known/agent-registration.json` — got the Traefik fall-through 404. Manual `kubectl patch` enabled registration, exposed a fourth defect: 4. `buildActiveRegistrationDocument` in the serviceoffer-controller unconditionally overwrote `Spec.Registration.Description` for inference offers with `"<model.name> inference via x402 micropayments"`, even when the operator had supplied an explicit description at sell time. This PR fixes all four: - Add `--no-register`, `--register-name`, `--register-description`, `--register-image`, `--register-skills`, `--register-domains`, `--register-metadata` to `obol sell inference`. - Rename `sellHTTPRegistrationInput` / `buildSellHTTPRegistrationConfig` to the unqualified `sellRegistrationInput` / `buildSellRegistrationConfig` since they now serve both inference and http call sites. - Extend `buildInferenceServiceOfferSpec` to accept the resolved model name and the registration block, write `spec.model.name = <real model id>`, and merge the registration block into `spec.registration` when non-empty. - In the controller's `buildActiveRegistrationDocument`, only fall back to the model-aware description string when the operator left `Spec.Registration.Description` empty. Tests that would have caught the regression earlier: - `TestSellInference_Flags` now requires the six registration flags; their absence on `main` was the bug. - `TestBuildInferenceServiceOfferSpec_RegistrationEnabledByDefault` pins that defaults produce `spec.registration.enabled = true` and the offer name as `spec.registration.name`. - `TestBuildInferenceServiceOfferSpec_NoRegisterOmitsRegistration` pins the `--no-register` opt-out. - `TestBuildInferenceServiceOfferSpec_OperatorOverridesWin` pins that operator-supplied name/description/image/skills/domains all survive into the spec verbatim. - `TestBuildInferenceServiceOfferSpec_ModelNameNotHardcoded` pins that `spec.model.name` reflects `--model`, not the historical "ollama" literal. - `TestBuildActiveRegistrationDocument_KeepsOperatorDescription` pins the controller-side fix: an operator description survives the buildActiveRegistrationDocument pass. - `TestBuildActiveRegistrationDocument_FallsBackToModelDescriptionForInference` pins the other branch — inference offers with no operator description still get the model-aware default, not the generic one. Drive-by: `TestSellInference_Flags` was failing on `main` after #470 removed the `--price` default but didn't update the corresponding assertion. Updated to assert `--price` default = "" with a comment explaining the contract.

The first revision of #485 stopped at "produce a ServiceOffer with a populated spec.registration block". That made /.well-known/agent-registration.json get routed, but left the offer in Registered=False AwaitingExternalRegistration until someone manually ran `obol sell register`. The serviceoffer-controller's storefront filter (buildServiceCatalogJSON in render.go) requires Ready=True, which transitively requires Registered=True, so the offer was silently excluded from /api/services.json — the very feed the operator's own storefront UI consumes. obol sell http already auto-registers on the same code path. Mirror that here so the inference path reaches the same end state without a follow-up obol sell register step. Changes: - Extract shouldAutoRegisterSell(spec, tunnelURL) as the shared decision predicate. The same gate now drives both the http and inference auto-register call sites; defensively returns false when the registration block is missing, disabled, malformed, or the tunnel URL is empty. - sellInferenceCommand Action: after kubectlApply + EnsureTunnelForSell succeed, if shouldAutoRegisterSell says yes, call autoRegisterServiceOffer with the resolved name/description/wallet pulled from the spec. Surface failures as warnings + a re-run hint rather than aborting the gateway start, because the underlying call needs gas on the target chain and we do not want a one-off RPC hiccup to block local dev. Test coverage that would have caught the regression: - TestShouldAutoRegisterSell - table-driven over six scenarios including the defensive cases (registration not a map, registration.enabled not a bool). Both call sites use the helper. - TestSellInferenceAction_InvokesAutoRegister - source-level guard that scans the sellInferenceCommand body for shouldAutoRegisterSell and autoRegisterServiceOffer. The bug we just fixed was "Action calls neither"; an innocent refactor could remove the calls without any unit-level signal otherwise. Operational note: the agent's remote-signer wallet needs a small ETH balance on the target chain (~0.20-0.50 USD typical) for the on-chain register tx. If the wallet is unfunded the warning fires and the offer stays in AwaitingExternalRegistration; the operator can fund the wallet and re-run obol sell register to finish.

…straint) The autoRegisterServiceOffer pre-flight check rejected any registration whose signer didn't match the offer's payTo wallet: registration signer 0xA... does not match the payment wallet 0xB... Use a matching signer, omit --wallet so the remote-signer wallet is used, or pass --no-register The error wording read like an ERC-8004 limitation but isn't. ERC-8004 treats the agent OWNER (msg.sender at register time) and the agent WALLET (settable post-mint via setAgentWallet) as independent addresses. x402 settlement honors the offer's spec.payment.payTo directly — buyers pay that address regardless of what the registry's getAgentWallet returns. The "hot signer, cold/multisig payee" split is the canonical pattern. The historic guard existed because the obol CLI never exposed setAgentWallet, so a mismatched registration left operators with no in-CLI recovery path. This change instead surfaces the split as an informational note + adds `obol sell update <name> --pay-to <new>` as the recovery surface (already in tree; just needed test coverage and the connection wired into the diagnostic). Changes: - signerPayeeDelegationNote(signer, payTo) returns a human-readable note when the two diverge (case-insensitive, whitespace-tolerant, empty on either side) and "" otherwise. Used by autoRegisterServiceOffer instead of the previous early-return. - buildSellUpdatePatch(payTo, chain, price) extracted from the inline sellUpdateCommand Action so the patch shape — the thing that actually hits the cluster — is testable without a live offer. Action calls the helper instead of inlining the same logic. Tests: - TestSignerPayeeDelegationNote — 6-case table: match, case-insensitive, whitespace, empty payTo, empty signer (defensive), true mismatch (assertions name the addresses + advise sell update). - TestAutoRegister_AllowsSignerPayeeMismatch — source-level guard asserting the banned error wording is gone from autoRegisterServiceOffer and the soft-notice path is wired. Anyone re-introducing the check has to delete this test too, which forces them to read the rationale. - TestBuildSellUpdatePatch_PayToOnly — `obol sell update <name> --pay-to 0xBooB` builds a patch that touches only spec.payment.payTo, not network or price. - TestBuildSellUpdatePatch_PriceSwitchNullsOldKeys — table over perRequest/perMTok/perHour: the unused keys are explicitly null so a switch (e.g. perRequest → perMTok) doesn't leave the previous key fighting through merge semantics. - TestBuildSellUpdatePatch_NoFieldsErrors — error fires when no fields are set, and the error names the flags the operator should pass. - TestSellUpdate_PayToFlagSurface — `obol sell update` exposes --pay-to (with --wallet/--recipient/-w aliases via payToFlag), and --namespace is Required.

Closes the "cluster comes back but the seller offers don't" gap. After `obol stack down` destroys the k3d cluster, all ServiceOffer custom resources are wiped from etcd. The descriptors on disk at ~/.workspace/config/inference/<name>/ survive, but nothing replays them when the cluster comes back, so operators had to manually re-run `obol sell inference <name>` for every offer after every stack up. This wires a resume step into `obol stack up`: after stack.Up returns successfully, the action iterates inference.Store, rebuilds the Service+Endpoints+ServiceOffer for each persisted deployment from the on-disk descriptor, and kubectl-applies them. The foreground host gateway is NOT auto-started — it is an interactive operator action and stack-up shouldn't launch long-running processes — but the cluster side is fully reattached so the operator's eventual `obol sell inference <name>` rerun hits a "service already healthy" reconcile instead of "build from scratch." Storage model: - inference.Deployment gains ModelName, ServiceNamespace, and Registration fields, persisted with `omitempty` so legacy descriptors written by older binaries still load (the new fields come back as zero values; the resume path either defaults sensibly or refuses with an actionable message). - The inference Action now resolves the registration block once and passes the same map to both store.Create() and the in-process ServiceOffer apply — guaranteeing on-disk state matches what the cluster sees. Resume path: - resumeSellOffers(ctx, cfg, u): lists inference.Store, skips when no kubeconfig (no cluster yet), warns + continues per-offer on errors. - resumeOneInferenceOffer: createHostService + buildInferenceServiceOfferSpec + kubectlApply, no foreground process. Returns an actionable error on missing ModelName so legacy descriptors surface a clear "recreate the offer" prompt rather than producing a broken ServiceOffer. - Wired from cmd/obol/main.go's `stack up` Action after stack.Up. A resume failure is logged as a warning, not propagated — stack-up must succeed even if one descriptor is malformed. Tests: - TestStoreCreate_PersistsResumeFields: pins ModelName, ServiceNamespace, and Registration round-trip through JSON. Without this round-trip, the resume path silently loses operator customizations. - TestStoreCreate_LegacyDescriptorWithoutResumeFields: pins backwards-compatibility — a JSON written by an older binary still loads, with the new fields as zero values. - TestResumeSellOffers_EmptyStoreNoOp: empty store, resume returns nil. - TestResumeSellOffers_DescriptorPresentButNoCluster: descriptor on disk, kubeconfig absent (post-purge, pre-up state) — resume must skip cleanly. - TestResumeOneInferenceOffer_RequiresModelName: legacy descriptor with empty ModelName surfaces an actionable error naming the missing field and the recovery command. - TestResumeOneInferenceOffer_NilDescriptor: defensive nil/empty descriptor guard so one bad entry can't crash the loop. - TestStackUpAction_CallsResumeSellOffers: source-level guard that cmd/obol/main.go's `stack up` Action calls resumeSellOffers AFTER stack.Up. Pinning the order — running resume before stack.Up would see no kubeconfig and skip every offer silently. Operational note: `obol sell http` offers don't yet have an on-disk descriptor (only `obol sell inference` persists), so they aren't covered by this resume pass. Filed as a follow-up — adding a parallel manifest store for `sell http` plus a third resume branch is its own PR.

…ont filter The first cut of stack-up resume (prior commit) reattached cluster-side artifacts but stopped short of two things that together left the operator's storefront empty after every stack-up cycle: 1. The foreground x402 gateway never restarted, so UpstreamHealthy=False. 2. Even after a manual gateway restart, the controller's storefront filter required Ready=True — which itself requires Registered=True — and an unfunded agent wallet (or a deliberate "register later" choice) leaves Registered=False AwaitingExternalRegistration. The operationally-usable offer was invisible to the operator's own UI. Both addressed here so `obol stack up` is the only command needed. 1. Auto-start the gateway as a detached host subprocess: - startDetachedInferenceGateway reconstructs the original `obol sell inference <name> --model … --register-* …` invocation from the persisted inference.Deployment and spawns it with Setsid + Process.Release so it survives the parent's exit. - PID + log path live under <StateDir>/sell-inference/<name>/. readGatewayPID + processAlive let subsequent stack-ups detect a still-running gateway and skip the relaunch. - Surfaced via a u.Successf with the PID + log path so operators can `tail -f` or `kill $(cat .pid)` without parsing helpers. - sell_proc_unix.go isolates the platform-specific SysProcAttr behind the unix build tag, keeping the rest of the code portable. The long-term shape is to ship the inference gateway as a real helm-managed cluster Pod (it would survive stack-down/up like every other infra piece, and the controller could observe it as a first-class workload). Comment in startDetachedInferenceGateway spells that out so the next person looking at this knows the host subprocess + PID-file plumbing is the deliberate near-term step, not the final form. 2. Relax the storefront filter: - offerOperationallyReady = ModelReady + UpstreamHealthy + PaymentGateReady + RoutePublished. Registered is intentionally NOT in the AND — on-chain ERC-8004 registration is publication metadata, not operational readiness. Aggregate Ready=True is also accepted as a shortcut so existing test fixtures and any external callers that only emit the aggregate signal still work. - offerAwaitingRegistration flags Registered=False AwaitingExternalRegistration specifically, and buildServiceCatalogJSON propagates it as ServiceCatalogEntry.RegistrationPending=true so storefront UIs can badge "registration pending" alongside the usable offer. Tests: - TestBuildResumeGatewayArgs (3 cases): full descriptor incl. registration map, --no-register path, legacy descriptor with no Registration map. Pins positional-name-before-flags ordering so a CLI surface tweak can't silently desync the resume relaunch. - TestReadGatewayPID (8 cases): clean integer, trailing newline, surrounding whitespace, zero rejected, negative rejected, non-numeric rejected, empty rejected, missing-file rejected. Format is one decimal int in ASCII so external tooling (`kill $(cat .pid)`) works without parsing. - TestProcessAlive_SelfAndBogus: self is alive, absurd PID is not. - TestOfferOperationallyReady_IncludesAwaitingExternalRegistration: the headline behavioral fix — an offer with all four ops conditions True + Registered=False(AwaitingExternalRegistration) is operationally ready. - TestOfferOperationallyReady_RejectsRealNotReady: an offer with UpstreamHealthy=False is still excluded — the relax is narrow to the registration gate only. - TestBuildServiceCatalogJSON_IncludesPendingRegistrationOffers: end-to-end through /api/services.json — AwaitingExternalRegistration offers appear with RegistrationPending=true. - TestBuildServiceCatalogJSON_RegistrationPendingFalseForFullyReady: the negative — fully Ready=True offers must NOT carry RegistrationPending or the storefront badge would stick around on healthy offers. Operational footnote: the auto-spawned gateway picks up the same inference.Deployment that originally created the offer. If the descriptor predates this PR and has no ModelName, the resume per-offer warning fires (already pinned by TestResumeOneInferenceOffer_RequiresModelName in the prior commit) — recreate the offer once with the new code so the descriptor has the resume-side fields, after which every subsequent stack down/up cycle reattaches the offer with zero extra commands.

Surfaced during the live spark2 walkthrough of the prior commit: 1. Gateway PID printed as -1. cmd.Process.Release() on Unix sets p.Pid to -1 as part of its handle teardown, so reading cmd.Process.Pid AFTER Release prints a bogus -1 to the operator and pins -1 in the PID file. Snapshot the PID before the Release call. 2. Stale "Host gateways are not auto-started" footer. Carried over from the pre-auto-spawn version of the resume function. With the gateway now auto-spawned, the message is misleading. Replaced with a pointer to the gateway log path so operators can `tail -f` immediately. 3. /skill.md catalog used the strict Ready=True filter while /api/services.json used the new operationally-ready filter. The two surfaces should stay consistent — an offer that's usable for x402 payments shows up in BOTH the storefront feed and the operator- facing skill catalog. Switched buildSkillCatalogMarkdown to the same offerOperationallyReady predicate. No new tests; the existing TestOfferOperationallyReady_* / catalog tests cover the filter relax for both call sites since they share the predicate. The PID-reading order is a one-line semantic fix; the footer wording isn't worth a test.

Closes the second half of the "all paid services come back automatically after stack up" promise. The inference resume in earlier commits only handled `obol sell inference` because that's the only sell command whose state was persisted on disk. `obol sell http` always built its ServiceOffer manifest fresh from CLI flags and kubectl-applied it; no on-disk trace, so the resume loop had nothing to replay for the http case. On-disk schema: <ConfigDir>/sell-http/<namespace>__<name>.yaml One YAML file per offer, holding the rendered ServiceOffer manifest verbatim. The `<namespace>__<name>` filename ensures two offers with the same name in different namespaces never collide. We keep this separate from the inference store on purpose: inference descriptors carry host-side fields (ListenAddr, ModelName, Registration, …) that HTTP offers don't have, and folding them into one schema would force empty/optional fields onto every HTTP descriptor. A future unification is fine but not the right MVP shape. Lifecycle: - `obol sell http` (both --from-json and flag-driven paths) calls persistSellHTTPOffer right after kubectlApply succeeds. Persistence failures emit a warning but don't abort — the cluster-side offer is already in place, the disk state is the recovery aid, not the source of truth. - `obol sell delete` calls removeSellHTTPOffer so a stack-up after a delete doesn't resurrect the offer. Idempotent; missing-file removals are silent no-ops. - `obol stack up`'s resumeSellOffers calls resumeSellHTTPOffers after the inference branch. The early-return for "no inference offers" was rewritten to fall through to the http branch — operators with only HTTP offers and no inference offers must still get their storefront back. resumeSellHTTPOffers reads each YAML, unmarshals, and kubectl-applies. Per-offer errors are surfaced as warnings; the loop keeps going so one corrupt YAML can't block the rest. Tests: - TestPersistSellHTTPOffer_RoundTrip — pins the on-disk shape (path layout + YAML body content) by writing a manifest, reading it back, and asserting metadata.{name,namespace} + spec.payment.payTo survive. - TestPersistSellHTTPOffer_NamespaceIsolation — two offers with the same name in different namespaces produce distinct files. - TestRemoveSellHTTPOffer_DropsPersistedManifest — the symmetric teardown contract: file goes away on delete, idempotent on double- delete, defensive on empty inputs. - TestResumeSellHTTPOffers_EmptyStoreNoOp — no offers ever persisted must be a no-op, not an error. - TestSellDeleteAction_CallsRemoveSellHTTPOffer — source-level guard that the delete handler still calls the cleanup. Without it, an innocent refactor could leak persisted manifests and the only downstream symptom would be "deleted offers spookily come back" on the next stack-up — hard to attribute. - TestResumeSellOffers_HTTPOnlyStore — pins that the early-return rewrite works: an http-only workspace still gets its offers replayed. The sell-inference resume tests from the prior commits still cover their respective paths; the http-side additions are independent so neither store interferes with the other.

…05-13 hunt Three of the four "Remove once root cause identified" debug commits identified in plans/release-smoke-hardening-20260513.md (the fourth, internal/x402/verifier.go::HandleProxy, was already removed in 58dd89c as a CodeQL log-injection fix): - internal/x402/forwardauth.go: drop two log.Printf calls in facilitatorVerify (verify-body dump + facilitator response body dump). Both logged user-controlled bytes from X-PAYMENT — same log-injection risk class as the verifier one stripped earlier. - internal/x402/buyer/signer.go: drop five log.Printf calls in PreSignedSigner.CanSign (DENY network/payTo/asset/amount/exhausted). Logged untrusted PaymentRequirements values from seller 402 responses. Drop "log" import (was the sole user). - flows/flow-11-dual-stack.sh: keep the post-fail cluster-state diag capture (it's general failure-time operator artifact, useful for any future step-43 failure, not just the EnsureVerifier hunt) but rename the dir from "flow11-step43-debug" to "flow11-paid-fail-diag" and rephrase the comment to drop the now-stale "EnsureVerifier hunt" framing. Logs go to the same artifact dir; no behavior change. Validation: go test ./internal/x402/... ./cmd/obol/... -count=1 passes; bash -n on flows/flow-11-dual-stack.sh + flows/lib.sh passes. Plan doc plans/post-490-integration-20260513.md captures the full seven-phase work the user asked for.

Resolves a single conflict in cmd/obol/sell_test.go: keep #487's new explanatory comment block about why --price has no default after #470 (which is already on main as 4add2c8). The comment is correct and forward-looking; #487's flag-list additions for register-* parity applied cleanly alongside. Validation: go build ./... + go test ./cmd/obol/... ./internal/inference/... ./internal/serviceoffercontroller/... ./internal/schemas/... all green. Closes #487.

Three conflicts resolved, all of the same shape: HEAD (which already includes #487's storefront + registration-document changes) had new content where pr-489 had nothing at the same line. Took HEAD content in each case; pr-489 doesn't modify those sections. Conflicts resolved: - cmd/obol/sell_test.go: kept HEAD's --price-default explanatory comment block (#487 added it; references #470 already on main). - internal/serviceoffercontroller/render.go: kept HEAD's buildActiveRegistrationDocument/buildTombstoneRegistrationDocument/ buildRegistrationServices functions (#487 added them). - internal/serviceoffercontroller/render_test.go: kept HEAD's three new tests (Description-keep/fallback/IncludesOwner). #489 brings: AgentIdentity CRD (x402/default), identity_controller + identity_render pair, agentidentity-crd.yaml template, idempotent `obol sell register` per chain, `obol sell identity import` for adopting verified legacy agent IDs, monetizeapi.AgentIdentity type, embed_crd_test coverage of the new CRD. Validation: go build ./... clean; go test ./cmd/obol/... ./internal/x402/... ./internal/serviceoffercontroller/... ./internal/inference/... ./internal/schemas/... ./internal/erc8004/... ./internal/embed/... ./internal/monetizeapi/... ./internal/defaults/... all pass. Closes #489.

Single-stack Bob buyer that skips the Alice provisioning every other buy flow does. Targets third-party x402 sellers — primary use case is https://inference.v1337.org/services/aeon (the Phase 7 test of the post-490 integration plan). Parameterized via env vars (or matching CLI flags): - EXTERNAL_ENDPOINT e.g. https://inference.v1337.org/services/aeon - EXTERNAL_MODEL e.g. aeon-ultimate - EXTERNAL_TOKEN e.g. 0x0a09371a8b011d5110656ceBCc70603e53FD2c78 - EXTERNAL_CHAIN default base-sepolia - EXTERNAL_PAYTO e.g. 0xeFAb75b7b199bf8512e2d5b379374Cb94dfdBA47 - EXTERNAL_PRICE wei amount, e.g. 23000000000000000 (0.023 OBOL) - EXTERNAL_FACILITATOR default https://x402.gcp.obol.tech Step sequence: 1. Source flows/lib.sh (helpers + .env auto-load) and lib-dual-stack.sh (run_with_timeout, preseed_bob_wallet via bob()). 2. Derive Bob deterministically from .env REMOTE_SIGNER_PRIVATE_KEY. 3. Probe EXTERNAL_ENDPOINT without X-PAYMENT, parse 402 body, assert accepts[0] matches token/chain/price/payTo. Fail loudly with diff. 4. Confirm Bob ERC-20 balance >= EXTERNAL_PRICE; abort with the suggested cast send command if not (Alice is intentionally absent). 5. Bring up single Bob k3d stack under .workspace-bob-external/. 6. Invoke buy.py via 'obol kubectl exec' — bypasses ERC-8004 discovery (the seller's well-known doc lacks agentId). 7. Wait for PurchaseRequest Ready=True (5 min cap), port-forward LiteLLM, issue one paid request via paid/$EXTERNAL_MODEL, capture response + X-PAYMENT-RESPONSE. 8. Verify on-chain Transfer event from Bob to EXTERNAL_PAYTO, capture tx hash. 9. Write structured artifact dir with probe-402.json, purchaserequest.yaml, buyer-status-{before,after}.json, paid-response.json, settlement-tx.json, bob-balance-{before,after}.txt. Pipes shell output through scrub_secrets so paid-RPC URLs collapse to [REDACTED].<tld>/[REDACTED]. On PASS leaves the cluster up for inspection; only tears down on FAIL.

spark1 release-smoke RUN_ID=20260513-post490-1 failed at flow-13 step 49 with `expected remaining=5, got remaining=1`. Same bug class as the flow-14 fix in 6c52a67 last cycle: stale exact-count assertion. Two changes both mirror what flow-14 already does: 1. Prompt: change "Load the buy-x402 skill, then use your terminal tool" -> "Use the buy-x402 skill and your terminal tool". The "Load first, then use" framing seems to make the agent read buy-x402/SKILL.md and copy an example count value rather than honoring the explicit `--count 5` from the prompt. flow-14's wording does not exhibit this on the same model (qwen36-fast). 2. Assertion: change `remaining=5` exact match to `remaining>=5` programmatic check (poll for `remaining=[0-9]+`, then `[ "$n" -ge 5 ]`). Identical pattern to flow-14:997-1009. Allows controller-merged pools (e.g. remaining=10 on rerun) and is forgiving of one-off LLM stochasticity without losing the "buy actually provisioned" signal. Same commit on the same QA worktree had flow-11 and flow-14 PASS (both using `--count 5` against the live public facilitator), so the agent does honor count=5 most of the time; flow-13 is the unlucky draw on this run. Aligning the two flows removes the divergence as a confounder for future failures.

The knob existed as a workaround for the prom-overlay arm64 manifest shipping an amd64 binary. Upstream fix merged in ObolNetwork/x402-rs#3 (668b7bb) and 1.4.9 was republished: before: arm64 digest sha256:6b2198df... (amd64 ELF, crashloop) after: arm64 digest sha256:b209345c... (real aarch64 ELF) Validated end-to-end with three release-smoke runs on spark1 (post-490 QA worktree, OBOL_LLM_ENDPOINT routing through qwen36-fast on host vLLM): RUN_ID=20260513-post490-1 8fb42c9 12/13 PASS (flow-13 FAIL: stale assertion → fixed in d1f97e3) RUN_ID=20260513-post490-2-after-flow13-fix d1f97e3 12/13 PASS (flow-08 FAIL: transient cloudflared flake) RUN_ID=20260513-post490-3-no-skip-pull d1f97e3 13/13 PASS, RC=0 ← gate Run 3 exercised the no-SKIP_PULL path: flow-10 pulled the registry image fresh and brought up the facilitator container under qemu-emulated aarch64; flow-08 then completed the full anvil-fork buy roundtrip without falling back to the local-build path. Files touched: - flows/lib.sh: remove the if SKIP_PULL branch in x402_facilitator_image. - .agents/skills/obol-stack-dev/SKILL.md: hard-won-lessons row 9 rewritten to "fixed upstream" with the new arm64 digest. - .agents/skills/obol-stack-dev/references/release-smoke-debugging.md: entry #9 marked historical, new digest recorded, knob removal noted. - plans/release-smoke-hardening-20260513.md: retro entry #7 marked RESOLVED with merge commit + new digest; follow-up items #1 and #2 marked DONE. Nothing else needs the knob; QA hosts can drop the locally-built tag and let flow-10 pull from ghcr.io.

Sellers serving x402 v2 return network in CAIP-2 form (e.g. 'eip155:84532'). Operators commonly pass the legacy alias ('base-sepolia') on the CLI. The probe-step assertion compared them as strings and silently exited with no FAIL print under set -e+pipefail when they differed, because the python diff status non-zero combined with the EXIT trap swallowed the fail() output. Add a small CAIP-2 normalization map (covers mainnet, base, base-sepolia, sepolia, hoodi, polygon, optimism, arbitrum, avalanche) and apply it to both sides before comparing. Unknown chain ids pass through unchanged. Discovered while running the harness against https://inference.v1337.org which advertises 'eip155:84532'.

k3d enforces cluster name <= 32 chars including the 'obol-stack-' prefix (11 chars), so the user-supplied stack-id portion must be <= 21 chars. The default 'post490-buy-external-bob' (23 chars) → 'obol-stack-post490- buy-external-bob' (34 chars) tripped: FATA Failed Cluster Configuration Validation: provided cluster name 'obol-stack-post490-buy-external-bob' does not match requirements: Cluster name must be <= 32 characters, but has 35 Shorten the default to 'buy-ext-bob' (11 chars → 22 total) and add a comment so the next operator picking an EXTERNAL_STACK_ID stays under the cap. Discovered on the first real run against https://inference.v1337.org.

Sellers behind Cloudflare WAF blocking the default Python-urllib UA return HTTP 403 + error code 1010 instead of 402, which surfaces as "Unexpected HTTP 403 (expected 402). Body: error code: 1010" inside the agent pod and "Failed to get pricing. Aborting.". Add a module-level USER_AGENT constant defaulting to "obol-buy-x402/1.0 (+https://github.com/ObolNetwork/obol-stack)" and attach it as the User-Agent header on both the 402 probe (kind=http and kind=inference) and the paid X-PAYMENT request. The constant is overridable via OBOL_BUYER_USER_AGENT for sellers with stricter rules (e.g. that require a browser UA). Discovered while running flows/buy-external.sh against https://inference.v1337.org/services/aeon (post-490 integration branch). The harness's outer probe (step 2) passes CF because the default UA of the host-side fetch is allowed; the agent-pod buy.py call did not, because urllib's default UA is the one that's blocked.

Phase 7 of the post-#490 integration plan. Documents five attempts of the new flows/buy-external.sh harness against the production seller at https://inference.v1337.org/services/aeon as Bob (the deterministic 2nd-derived wallet from .env REMOTE_SIGNER_PRIVATE_KEY). Outcome: - Probe + 402 + accepts validation: green - Permit2 authorization signing: green (1 auth for 0.023 OBOL) - PurchaseRequest CR creation in hermes-obol-agent: green - Controller reconciliation: BLOCKED — observedGeneration never advanced past 0; kubectl exec session SIGKILLed (exit 137). Hypothesis: the serviceoffer-controller does not reconcile PRs whose endpoint does not match any in-cluster ServiceOffer.spec.upstream (external-seller mode is unimplemented). - Settlement: not attempted (gated on PR Ready). Bob OBOL balance unchanged across the test (4.978 OBOL). Surfaced and fixed four bugs along the way: - k3d 32-char cluster-name cap (7554b5e) - CAIP-2 vs legacy chain id mismatch in the harness probe (3d8e231) - Cloudflare WAF blocking Python-urllib UA in buy.py (c2dddc1) - Stale .build/obol vs freshly-rebuilt .workspace/bin/obol (operator- level fix, not committed code — follow-up #2 in the report) Includes follow-ups for the controller-side external-seller gap, the harness binary-path normalization, a CF-WAF UA note for the troubleshooting reference, and adding a KEEP_CLUSTER_ON_FAIL knob so the next external-seller failure has live diagnostic state to inspect.

…ot on FAIL When `flows/buy-external.sh` fails (typically at step 14, the `buy.py buy` invocation), the existing `external_cleanup` immediately tears the cluster down — destroying the only places that record why the PurchaseRequest never advanced (controller logs, PR `status.conditions[]`, sidecar `/status`). This commit: - Adds `external_snapshot_on_fail()` — best-effort capture of controller logs (current + `--previous`), PurchaseRequest YAML across all namespaces, buyer sidecar `/status` (via `kubectl exec ... python3` against the litellm container — buyer container is distroless), `cluster-pods.txt`, and recent `cluster-events.txt`. All commands wrapped in `|| true` so a single failure doesn't abort the bundle. Empty/failed files are removed. - Calls the snapshot from `external_cleanup` BEFORE any teardown, on the failure path only — clean exits keep the existing fast-cleanup behavior. - Honors `KEEP_CLUSTER_ON_FAIL=1` (default unset) — when set, skips `bob stack down` after the snapshot bundle is written and prints the preserved stack id + artifact dir + manual cleanup hint. Unblocks investigation of v1337-style external-seller failures documented in plans/inference-v1337-buy-report-20260514.md.

…otstrap `bootstrap_flow_workspace` previously copied unconditionally from the caller-supplied path (always `$OBOL_ROOT/.build/obol`). When iterating on embedded skill content (e.g. `internal/embed/skills/buy-x402/scripts/buy.py`) it's easy to rebuild one of the two binaries and forget the other, silently baking pre-fix files into the cluster PVC via `syncObolSkills`. Burned six hours during the v1337 live-buy investigation (attempt 5 in plans/inference-v1337-buy-report-20260514.md). Now: stat both paths, pick the one with the larger mtime, and emit a 5-line WARN to stderr when the two differ by more than 5 minutes — header + both paths-with-mtimes + which one was picked + a one-line rebuild nudge. Cross- OS stat handled via `stat -c %Y` with `stat -f %m` fallback. Date formatted with `date -r <file>` (BSD/macOS friendly), GNU `date -u -d "@<epoch>"` fallback. Contract preserved (no return value, copies into `$dir/bin/obol`).

Adds entry #10 to the release-smoke debugging reference covering the HTTP 403 + Cloudflare error 1010 we hit on v1337 attempts 3–4: managed WAF rules block the default `Python-urllib/X.Y` UA. Documents the buy.py fix (commit c2dddc1) plus the unconfirmed-but-likely Go-side follow-up at internal/serviceoffercontroller/purchase.go:183, where Go's `http.Client` defaults to `User-Agent: Go-http-client/1.1` and may hit the same WAF block on the controller probe.

Re-ran the v1337 buy with the new KEEP_CLUSTER_ON_FAIL=1 knob (commit b749f95). The controller reconciled the PurchaseRequest in 55 seconds through Probed → AuthsLoaded → Configured → Ready, against the same external endpoint the original report failed on. The original report's central technical claim — "serviceoffer-controller does not reconcile PurchaseRequests for external sellers" — is false. The controller is endpoint-agnostic by design (verified by code review of internal/serviceoffercontroller/purchase.go). Attempt 5's reconcile-hang was almost certainly a kubectl-exec session SIGKILL (exit 137), not a controller bug — likely harness-side run_with_timeout firing while buy.py was still polling normally. Today's run did surface a real but unrelated quirk: LiteLLM's POST /model/new fails with EROFS because /etc/litellm/config.yaml is mounted read-only as a Kubernetes ConfigMap volume; the controller catches this and falls back to ConfigMap reload, which works fine. Pre-existing, worth one line in paid-flows.md so the next debugger isn't startled. Step 18 (paid request) failed for an operator-error reason: I picked qwen3.6-27b as the upstream model id, but v1337's vLLM serves under a different name. Bob's 0.023 OBOL was NOT consumed (LiteLLM 404'd before the buyer sidecar could settle). Companion to plans/inference-v1337-buy-report-20260514.md. Retracts follow-up #1 of that report.

…_FAIL knob Replaces the opt-in KEEP_CLUSTER_ON_FAIL=1 env knob (added in b749f95) with an unconditional rule: cleanup happens iff every step passes. On FAIL, snapshot the diagnostic bundle and preserve the cluster — every time, no env override needed. Also inverts the prior success-side default. The previous design left the cluster up on success "so the operator can poke around"; in practice operators re-ran the harness from scratch when they wanted fresh state, and the leftover cluster mostly leaked across runs. With the new gate, a green run leaves a clean machine. Net behavior: - success → bob stack down (clean state for next run) - failure → snapshot + preserve (operator pays one manual teardown when done diagnosing) The diagnostic snapshot helper from b749f95 is unchanged; only the preservation gate moved from an env knob to the implicit pass/fail state.

chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA

bussyjd added 22 commits May 12, 2026 17:31

Persist ERC-8004 agent identity

e81dcad

This was referenced May 14, 2026

feat(stack): paid services survive stack down/up (sell-inference + sell-http resume + storefront fixes) #487

Closed

Persist ERC-8004 agent identity #489

Closed

bussyjd mentioned this pull request May 15, 2026

chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA #493

Merged

4 tasks

Merge pull request #493 from ObolNetwork/chore/buy-external-followups

2a1c1b2

chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA

bussyjd merged commit f086ffe into main May 15, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

post-490: fold #487 + #489, strip debug logs, drop SKIP_PULL (smoke 13/13)#492

post-490: fold #487 + #489, strip debug logs, drop SKIP_PULL (smoke 13/13)#492
bussyjd merged 24 commits into
mainfrom
integration/post-490-cleanups

bussyjd commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 13, 2026

Summary

Scope

Validation

Live Chain Evidence

Runtime Evidence

Review Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant