chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA by bussyjd · Pull Request #493 · ObolNetwork/obol-stack

bussyjd · 2026-05-14T07:07:14Z

Summary

Phase-1 polish + investigation follow-ups from plans/inference-v1337-buy-report-20260514.md. Net result: the report's central technical claim ("controller doesn't reconcile external-seller PurchaseRequests") is retracted — re-running with the new diagnostic gate proved the controller is endpoint-agnostic by design.

Stacks on #492 (folds in #487 + #489 + post-490 cleanups). Targets integration/post-490-cleanups so the diff shows only the 4 commits this PR adds.

What's in this branch

feat(buy-external): KEEP_CLUSTER_ON_FAIL knob + diagnostic snapshot on FAIL (b749f95) — adds external_snapshot_on_fail(). On FAIL, snapshots controller logs (current + --previous), PR YAML across all namespaces, buyer sidecar /status (via kubectl exec ... python3 against the litellm container — buyer is distroless), cluster-pods.txt, and recent cluster-events.txt to the artifact dir before any teardown.
fix(flows): pick freshest of .build/obol vs .workspace/bin/obol (eb13055) — bootstrap_flow_workspace in flows/lib.sh now stats both paths, picks the larger mtime, and emits a 5-line WARN when they differ by >5 minutes. Removes the silent-stale-binary footgun documented as v1337 attempt 5.
docs(skill): document Cloudflare-WAF UA pitfall (849cd93) — entry Add ADK #10 in release-smoke-debugging.md covering HTTP 403 + Cloudflare error 1010 from default Python-urllib UA, the c2dddc1 buy.py fix, and the (unconfirmed) follow-up about Go's http.Client defaults at purchase.go:183.
docs(plans): retract v1337 controller-gap hypothesis (82108c3) — plans/inference-v1337-followup-20260514.md companion to the original report. Re-run on spark1 showed the controller reconciles in 55s through Probed → AuthsLoaded → Configured → Ready. The Go-side probe was NOT WAF-blocked.
refactor(buy-external): green-only cleanup gate, drop KEEP_CLUSTER_ON_FAIL knob (df5fcff) — replaces the opt-in env knob with an unconditional rule: cleanup happens iff every step passes. Also inverts the prior success-side default (which left the cluster up on success "so the operator can poke around" — in practice operators re-ran from scratch and the leftover cluster mostly leaked).

Net cleanup behavior

Run outcome	Behavior
Every step PASS	`bob stack down` — clean state for next run
Any step FAIL	Snapshot + preserve cluster; operator pays one manual `bob stack down` when done

No env knob; the pass/fail exit code is the gate.

Side findings (worth knowing)

The captured controller log surfaces a pre-existing LiteLLM hot-add quirk:

purchase: hot-add paid/qwen3.6-27b failed: POST /model/new: 400 Bad Request:
{"error":{"message":"Authentication Error, [Errno 30] Read-only file system: '/etc/litellm/config.yaml'", ...}}; relying on ConfigMap reload

LiteLLM's /model/new API tries to write back to the ConfigMap volume (read-only by Kubernetes default). The controller catches the 400 and falls back to the ConfigMap-reload path, which works. Not external-seller specific. Worth a one-liner in paid-flows.md so it stops surprising next-debugger.

Test plan

bash -n flows/buy-external.sh && bash -n flows/lib.sh — clean
spark1 re-run on chore/buy-external-followups against https://inference.v1337.org/services/aeon:
- 17 of 18 steps PASS
- PR Ready=True after 55s, observedGeneration: 1, paid/qwen3.6-27b published, remaining: 1, spent: 0
- Diagnostic gate preserved cluster on FAIL + 7 snapshot files written
- Step 18 fails on operator-error model name (qwen3.6-27b ≠ v1337's actual model id); Bob's 0.023 OBOL pre-signed auth was NOT consumed
No regressions to existing flows (only lib.sh::bootstrap_flow_workspace touched among shared code; signature preserved, contract unchanged)
Reviewer: confirm the green-only cleanup gate is the right default for this single-operator QA harness (CI flows that need different behavior would call out separately)

…ot on FAIL When `flows/buy-external.sh` fails (typically at step 14, the `buy.py buy` invocation), the existing `external_cleanup` immediately tears the cluster down — destroying the only places that record why the PurchaseRequest never advanced (controller logs, PR `status.conditions[]`, sidecar `/status`). This commit: - Adds `external_snapshot_on_fail()` — best-effort capture of controller logs (current + `--previous`), PurchaseRequest YAML across all namespaces, buyer sidecar `/status` (via `kubectl exec ... python3` against the litellm container — buyer container is distroless), `cluster-pods.txt`, and recent `cluster-events.txt`. All commands wrapped in `|| true` so a single failure doesn't abort the bundle. Empty/failed files are removed. - Calls the snapshot from `external_cleanup` BEFORE any teardown, on the failure path only — clean exits keep the existing fast-cleanup behavior. - Honors `KEEP_CLUSTER_ON_FAIL=1` (default unset) — when set, skips `bob stack down` after the snapshot bundle is written and prints the preserved stack id + artifact dir + manual cleanup hint. Unblocks investigation of v1337-style external-seller failures documented in plans/inference-v1337-buy-report-20260514.md.

…otstrap `bootstrap_flow_workspace` previously copied unconditionally from the caller-supplied path (always `$OBOL_ROOT/.build/obol`). When iterating on embedded skill content (e.g. `internal/embed/skills/buy-x402/scripts/buy.py`) it's easy to rebuild one of the two binaries and forget the other, silently baking pre-fix files into the cluster PVC via `syncObolSkills`. Burned six hours during the v1337 live-buy investigation (attempt 5 in plans/inference-v1337-buy-report-20260514.md). Now: stat both paths, pick the one with the larger mtime, and emit a 5-line WARN to stderr when the two differ by more than 5 minutes — header + both paths-with-mtimes + which one was picked + a one-line rebuild nudge. Cross- OS stat handled via `stat -c %Y` with `stat -f %m` fallback. Date formatted with `date -r <file>` (BSD/macOS friendly), GNU `date -u -d "@<epoch>"` fallback. Contract preserved (no return value, copies into `$dir/bin/obol`).

Adds entry #10 to the release-smoke debugging reference covering the HTTP 403 + Cloudflare error 1010 we hit on v1337 attempts 3–4: managed WAF rules block the default `Python-urllib/X.Y` UA. Documents the buy.py fix (commit c2dddc1) plus the unconfirmed-but-likely Go-side follow-up at internal/serviceoffercontroller/purchase.go:183, where Go's `http.Client` defaults to `User-Agent: Go-http-client/1.1` and may hit the same WAF block on the controller probe.

Re-ran the v1337 buy with the new KEEP_CLUSTER_ON_FAIL=1 knob (commit b749f95). The controller reconciled the PurchaseRequest in 55 seconds through Probed → AuthsLoaded → Configured → Ready, against the same external endpoint the original report failed on. The original report's central technical claim — "serviceoffer-controller does not reconcile PurchaseRequests for external sellers" — is false. The controller is endpoint-agnostic by design (verified by code review of internal/serviceoffercontroller/purchase.go). Attempt 5's reconcile-hang was almost certainly a kubectl-exec session SIGKILL (exit 137), not a controller bug — likely harness-side run_with_timeout firing while buy.py was still polling normally. Today's run did surface a real but unrelated quirk: LiteLLM's POST /model/new fails with EROFS because /etc/litellm/config.yaml is mounted read-only as a Kubernetes ConfigMap volume; the controller catches this and falls back to ConfigMap reload, which works fine. Pre-existing, worth one line in paid-flows.md so the next debugger isn't startled. Step 18 (paid request) failed for an operator-error reason: I picked qwen3.6-27b as the upstream model id, but v1337's vLLM serves under a different name. Bob's 0.023 OBOL was NOT consumed (LiteLLM 404'd before the buyer sidecar could settle). Companion to plans/inference-v1337-buy-report-20260514.md. Retracts follow-up #1 of that report.

…_FAIL knob Replaces the opt-in KEEP_CLUSTER_ON_FAIL=1 env knob (added in b749f95) with an unconditional rule: cleanup happens iff every step passes. On FAIL, snapshot the diagnostic bundle and preserve the cluster — every time, no env override needed. Also inverts the prior success-side default. The previous design left the cluster up on success "so the operator can poke around"; in practice operators re-ran the harness from scratch when they wanted fresh state, and the leftover cluster mostly leaked across runs. With the new gate, a green run leaves a clean machine. Net behavior: - success → bob stack down (clean state for next run) - failure → snapshot + preserve (operator pays one manual teardown when done diagnosing) The diagnostic snapshot helper from b749f95 is unchanged; only the preservation gate moved from an env knob to the implicit pass/fail state.

bussyjd added 4 commits May 14, 2026 12:37

bussyjd changed the base branch from main to integration/post-490-cleanups May 14, 2026 07:17

bussyjd changed the title ~~chore(buy-external): add KEEP_CLUSTER_ON_FAIL knob, normalize obol-bin path, document CF-WAF UA~~ chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA May 15, 2026

bussyjd merged commit 2a1c1b2 into integration/post-490-cleanups May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA#493

chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA#493
bussyjd merged 5 commits into
integration/post-490-cleanupsfrom
chore/buy-external-followups

bussyjd commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this branch

Net cleanup behavior

Side findings (worth knowing)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bussyjd commented May 14, 2026 •

edited

Loading