Skip to content

chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA#493

Merged
bussyjd merged 5 commits into
integration/post-490-cleanupsfrom
chore/buy-external-followups
May 15, 2026
Merged

chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA#493
bussyjd merged 5 commits into
integration/post-490-cleanupsfrom
chore/buy-external-followups

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 14, 2026

Summary

Phase-1 polish + investigation follow-ups from plans/inference-v1337-buy-report-20260514.md. Net result: the report's central technical claim ("controller doesn't reconcile external-seller PurchaseRequests") is retracted — re-running with the new diagnostic gate proved the controller is endpoint-agnostic by design.

Stacks on #492 (folds in #487 + #489 + post-490 cleanups). Targets integration/post-490-cleanups so the diff shows only the 4 commits this PR adds.

What's in this branch

  • feat(buy-external): KEEP_CLUSTER_ON_FAIL knob + diagnostic snapshot on FAIL (b749f95) — adds external_snapshot_on_fail(). On FAIL, snapshots controller logs (current + --previous), PR YAML across all namespaces, buyer sidecar /status (via kubectl exec ... python3 against the litellm container — buyer is distroless), cluster-pods.txt, and recent cluster-events.txt to the artifact dir before any teardown.

  • fix(flows): pick freshest of .build/obol vs .workspace/bin/obol (eb13055) — bootstrap_flow_workspace in flows/lib.sh now stats both paths, picks the larger mtime, and emits a 5-line WARN when they differ by >5 minutes. Removes the silent-stale-binary footgun documented as v1337 attempt 5.

  • docs(skill): document Cloudflare-WAF UA pitfall (849cd93) — entry Add ADK #10 in release-smoke-debugging.md covering HTTP 403 + Cloudflare error 1010 from default Python-urllib UA, the c2dddc1 buy.py fix, and the (unconfirmed) follow-up about Go's http.Client defaults at purchase.go:183.

  • docs(plans): retract v1337 controller-gap hypothesis (82108c3) — plans/inference-v1337-followup-20260514.md companion to the original report. Re-run on spark1 showed the controller reconciles in 55s through Probed → AuthsLoaded → Configured → Ready. The Go-side probe was NOT WAF-blocked.

  • refactor(buy-external): green-only cleanup gate, drop KEEP_CLUSTER_ON_FAIL knob (df5fcff) — replaces the opt-in env knob with an unconditional rule: cleanup happens iff every step passes. Also inverts the prior success-side default (which left the cluster up on success "so the operator can poke around" — in practice operators re-ran from scratch and the leftover cluster mostly leaked).

Net cleanup behavior

Run outcome Behavior
Every step PASS bob stack down — clean state for next run
Any step FAIL Snapshot + preserve cluster; operator pays one manual bob stack down when done

No env knob; the pass/fail exit code is the gate.

Side findings (worth knowing)

The captured controller log surfaces a pre-existing LiteLLM hot-add quirk:

purchase: hot-add paid/qwen3.6-27b failed: POST /model/new: 400 Bad Request:
{"error":{"message":"Authentication Error, [Errno 30] Read-only file system: '/etc/litellm/config.yaml'", ...}}; relying on ConfigMap reload

LiteLLM's /model/new API tries to write back to the ConfigMap volume (read-only by Kubernetes default). The controller catches the 400 and falls back to the ConfigMap-reload path, which works. Not external-seller specific. Worth a one-liner in paid-flows.md so it stops surprising next-debugger.

Test plan

  • bash -n flows/buy-external.sh && bash -n flows/lib.sh — clean
  • spark1 re-run on chore/buy-external-followups against https://inference.v1337.org/services/aeon:
    • 17 of 18 steps PASS
    • PR Ready=True after 55s, observedGeneration: 1, paid/qwen3.6-27b published, remaining: 1, spent: 0
    • Diagnostic gate preserved cluster on FAIL + 7 snapshot files written
    • Step 18 fails on operator-error model name (qwen3.6-27b ≠ v1337's actual model id); Bob's 0.023 OBOL pre-signed auth was NOT consumed
  • No regressions to existing flows (only lib.sh::bootstrap_flow_workspace touched among shared code; signature preserved, contract unchanged)
  • Reviewer: confirm the green-only cleanup gate is the right default for this single-operator QA harness (CI flows that need different behavior would call out separately)

bussyjd added 4 commits May 14, 2026 12:37
…ot on FAIL

When `flows/buy-external.sh` fails (typically at step 14, the `buy.py buy`
invocation), the existing `external_cleanup` immediately tears the cluster
down — destroying the only places that record why the PurchaseRequest never
advanced (controller logs, PR `status.conditions[]`, sidecar `/status`).

This commit:

- Adds `external_snapshot_on_fail()` — best-effort capture of controller
  logs (current + `--previous`), PurchaseRequest YAML across all namespaces,
  buyer sidecar `/status` (via `kubectl exec ... python3` against the
  litellm container — buyer container is distroless), `cluster-pods.txt`,
  and recent `cluster-events.txt`. All commands wrapped in `|| true` so a
  single failure doesn't abort the bundle. Empty/failed files are removed.

- Calls the snapshot from `external_cleanup` BEFORE any teardown, on the
  failure path only — clean exits keep the existing fast-cleanup behavior.

- Honors `KEEP_CLUSTER_ON_FAIL=1` (default unset) — when set, skips
  `bob stack down` after the snapshot bundle is written and prints the
  preserved stack id + artifact dir + manual cleanup hint.

Unblocks investigation of v1337-style external-seller failures documented
in plans/inference-v1337-buy-report-20260514.md.
…otstrap

`bootstrap_flow_workspace` previously copied unconditionally from the
caller-supplied path (always `$OBOL_ROOT/.build/obol`). When iterating on
embedded skill content (e.g. `internal/embed/skills/buy-x402/scripts/buy.py`)
it's easy to rebuild one of the two binaries and forget the other, silently
baking pre-fix files into the cluster PVC via `syncObolSkills`. Burned six
hours during the v1337 live-buy investigation (attempt 5 in
plans/inference-v1337-buy-report-20260514.md).

Now: stat both paths, pick the one with the larger mtime, and emit a 5-line
WARN to stderr when the two differ by more than 5 minutes — header + both
paths-with-mtimes + which one was picked + a one-line rebuild nudge. Cross-
OS stat handled via `stat -c %Y` with `stat -f %m` fallback. Date formatted
with `date -r <file>` (BSD/macOS friendly), GNU `date -u -d "@<epoch>"`
fallback. Contract preserved (no return value, copies into `$dir/bin/obol`).
Adds entry #10 to the release-smoke debugging reference covering the
HTTP 403 + Cloudflare error 1010 we hit on v1337 attempts 3–4: managed
WAF rules block the default `Python-urllib/X.Y` UA. Documents the buy.py
fix (commit c2dddc1) plus the unconfirmed-but-likely Go-side follow-up
at internal/serviceoffercontroller/purchase.go:183, where Go's
`http.Client` defaults to `User-Agent: Go-http-client/1.1` and may hit
the same WAF block on the controller probe.
Re-ran the v1337 buy with the new KEEP_CLUSTER_ON_FAIL=1 knob (commit
b749f95). The controller reconciled the PurchaseRequest in 55 seconds
through Probed → AuthsLoaded → Configured → Ready, against the same
external endpoint the original report failed on.

The original report's central technical claim — "serviceoffer-controller
does not reconcile PurchaseRequests for external sellers" — is false.
The controller is endpoint-agnostic by design (verified by code review of
internal/serviceoffercontroller/purchase.go). Attempt 5's reconcile-hang
was almost certainly a kubectl-exec session SIGKILL (exit 137), not a
controller bug — likely harness-side run_with_timeout firing while
buy.py was still polling normally.

Today's run did surface a real but unrelated quirk: LiteLLM's POST
/model/new fails with EROFS because /etc/litellm/config.yaml is mounted
read-only as a Kubernetes ConfigMap volume; the controller catches this
and falls back to ConfigMap reload, which works fine. Pre-existing,
worth one line in paid-flows.md so the next debugger isn't startled.

Step 18 (paid request) failed for an operator-error reason: I picked
qwen3.6-27b as the upstream model id, but v1337's vLLM serves under a
different name. Bob's 0.023 OBOL was NOT consumed (LiteLLM 404'd before
the buyer sidecar could settle).

Companion to plans/inference-v1337-buy-report-20260514.md. Retracts
follow-up #1 of that report.
@bussyjd bussyjd changed the base branch from main to integration/post-490-cleanups May 14, 2026 07:17
…_FAIL knob

Replaces the opt-in KEEP_CLUSTER_ON_FAIL=1 env knob (added in b749f95)
with an unconditional rule: cleanup happens iff every step passes. On
FAIL, snapshot the diagnostic bundle and preserve the cluster — every
time, no env override needed.

Also inverts the prior success-side default. The previous design left
the cluster up on success "so the operator can poke around"; in
practice operators re-ran the harness from scratch when they wanted
fresh state, and the leftover cluster mostly leaked across runs. With
the new gate, a green run leaves a clean machine.

Net behavior:
- success → bob stack down (clean state for next run)
- failure → snapshot + preserve (operator pays one manual teardown
            when done diagnosing)

The diagnostic snapshot helper from b749f95 is unchanged; only the
preservation gate moved from an env knob to the implicit pass/fail
state.
@bussyjd bussyjd changed the title chore(buy-external): add KEEP_CLUSTER_ON_FAIL knob, normalize obol-bin path, document CF-WAF UA chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA May 15, 2026
@bussyjd bussyjd merged commit 2a1c1b2 into integration/post-490-cleanups May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant