Skip to content

Review giant monetise pr#316

Merged
OisinKyne merged 3 commits into
feat/monetize-pathfrom
oisin/monetize-overhaul
Apr 7, 2026
Merged

Review giant monetise pr#316
OisinKyne merged 3 commits into
feat/monetize-pathfrom
oisin/monetize-overhaul

Conversation

@OisinKyne
Copy link
Copy Markdown
Contributor

I removed the indexer stuff because claude thinks it wasn't used.

I renamed probe to test.

I asked claude to find issues and let it make almost all of them.

I include docs/pr-review-monetize-path.md so you can follow my journey on this PR review in a scalable manner (and to highlight anywhere claude was wrong and that we need to unscrew up).

I let it remove the ralph files because i think you have the other PR with all the flows and gherkin stuff which has superseeded your ralph loop if i'm not mistaken.

If you're happy, lets get this merged to yours, into main, and also merge your draft PR that improves logging etc.

@OisinKyne OisinKyne changed the title Remove indexer that's supposedly not used. Will reintroduce as part o… Review giant monetise pr Apr 7, 2026
@OisinKyne OisinKyne merged commit d259ae7 into feat/monetize-path Apr 7, 2026
@OisinKyne OisinKyne deleted the oisin/monetize-overhaul branch April 7, 2026 19:42
OisinKyne added a commit that referenced this pull request Apr 7, 2026
…exing (M3) (#288)

* feat: monetize-guide skill + sell probe command (#286)

Add monetize-guide skill (SKILL.md + seller-prompt reference) that teaches
Claude Code the end-to-end flow: pre-flight checks, model detection,
pricing research via ERC-8004 registry, user-confirmed pricing, sell
execution, reconciliation monitoring, and endpoint verification.

Add `obol sell probe <name> -n <ns>` command that hits the public tunnel
URL and verifies the endpoint returns 402 with valid x402 pricing headers.
Closes the feedback loop so Claude can confirm a service is live.

* fix: approximate per-hour pricing to per-request for x402 gating

--per-hour was passing the raw hourly price as the per-request charge
(e.g., $0.50/hour charged $0.50 per HTTP request). Now approximates
using a 5-minute experiment budget: perRequest = perHour * 5/60.

Also rewrites worker_api.py to use stdlib http.server (no Flask dep).

* feat: add reth-based ERC-8004 indexer (from #269)

* fix: address PR 265 review findings — security, correctness, test coverage

Critical:
- Fix path traversal in worker_api.py: validate experiment_id with regex
- Add GGUF format guard in publish.py before ollama_create()
- Run worker container as non-root user in Dockerfile

Medium:
- Replace manual provenance struct-to-map with JSON round-trip in sell.go
- Fix weak test assertion in TestApproximateRequestPriceFromPerHour
- Guard int(amount) cast in coordinate.py with try/except
- Remove domain-specific default "val_bpb" from CRD metricName field
- Guard maxTimeoutSeconds parse in monetize.py

Low:
- Add provenance propagation test for build_registration_doc
- Fix doc type inconsistency in coordination-protocol.md
- Assert all 6 Provenance fields in store_test.go

# Conflicts:
#	cmd/obol/sell.go
#	internal/embed/infrastructure/base/templates/serviceoffer-crd.yaml
#	internal/embed/skills/autoresearch-coordinator/references/coordination-protocol.md
#	internal/embed/skills/autoresearch-coordinator/scripts/coordinate.py
#	internal/embed/skills/autoresearch/SKILL.md
#	internal/embed/skills/autoresearch/scripts/publish.py
#	internal/embed/skills/sell/scripts/monetize.py
#	internal/inference/store_test.go
#	tests/test_sell_registration_metadata.py

* fix(m1): add sell probe command and flow validation scripts

- Add `obol sell probe <name> -n <ns>` — sends unauthenticated request
  through Traefik to verify the endpoint returns 402 with x402 pricing.
- Create flow-06-sell-setup.sh: idempotent ServiceOffer creation
- Create flow-07-sell-verify.sh: wait for reconciliation + verify conditions
- Create flow-10-anvil-facilitator.sh: Anvil fork + x402-rs facilitator
- Create flow-08-buy.sh: EIP-712 sign + paid request through Traefik

All flows pass end-to-end: sell → verify → pay → inference (HTTP 200).

* fix(m2): CRD skills/domains fields + coordinator x402 V1 parsing

- Add skills and domains array fields to ServiceOffer CRD registration
  schema. The --register-skills CLI flag was rejected by strict CRD
  validation because the fields were missing.
- Fix coordinator.py parse_402_pricing to handle x402 V1 standard
  format (accepts[] array) in addition to flat top-level fields.

Validated: worker ServiceOffer → 402 gate → .well-known → discovery →
coordinator probe.

* fix(oasf): align skill/domain taxonomy with official OASF schema

Paths validated against https://github.com/agntcy/oasf (cloned at R&D/oasf/).

Skills:
- natural_language_processing/text_generation/chat_completion → natural_language_processing/natural_language_generation/text_completion
- machine_learning/model_optimization → analytical_skills/model_optimization

Domains:
- technology/artificial_intelligence → technology/data_science
- technology/artificial_intelligence/research → research_and_development/scientific_research

Fixed in: monetize.py, coordinate.py, monetize-guide SKILL.md, autoresearch SKILL.md,
autoresearch-worker SKILL.md, coordinator references, seller-prompt.md

* fix(oasf): use real schema paths — devops_mlops/model_versioning

analytical_skills/model_optimization was invented (doesn't exist in OASF).
Closest real path: devops_mlops/model_versioning (validated against
agntcy/oasf schema repo at R&D/oasf/).

Also confirms prior fixes:
- natural_language_processing/natural_language_generation/text_completion
- technology/data_science
- research_and_development/scientific_research

* fix: stop stripping /v1 from custom endpoint URLs

LiteLLM does NOT auto-append /v1 for openai/ provider routes.
WarnAndStripV1Suffix was removing /v1 from api_base before storing
in the ConfigMap, causing LiteLLM to hit /chat/completions (404)
instead of /v1/chat/completions.

Verified against LiteLLM source (gpt_transformation.py:get_complete_url):
  api_base is passed as-is to the OpenAI SDK base_url, which appends
  /chat/completions directly. The /v1 must be in api_base.

Removed from all 3 call sites:
- cmd/obol/model.go (obol model setup custom)
- internal/openclaw/openclaw.go (promptForDirectProvider)
- internal/openclaw/openclaw.go (promptForCustomProvider)

* docs: add LAN resource selling path to monetize-guide

Adds a new "External LAN Resources" section to the monetize-guide SKILL.md
covering the flow for selling GPU servers or inference endpoints on the
local network (e.g., DGX Spark running vLLM).

The path: obol model setup custom (bridge into LiteLLM) → obol sell http
(create ServiceOffer pointing at LiteLLM). Documents that LAN IPs are
reachable from k3d without additional config, and that --endpoint must
include /v1 since LiteLLM does not auto-append it.

Validated end-to-end with Nemotron 3 Super 120B on 2x DGX Spark:
- obol model setup custom validates and adds to LiteLLM
- obol sell http creates ServiceOffer
- Agent heartbeat reconciles in ~90s → all 6 conditions True
- 402 gating works locally and through Cloudflare tunnel
- .well-known and /skill.md discovery updated automatically

* fix: restore tunnel open/close and host-service logic dropped during cherry-pick

Cherry-picking PR #265 into feat/monetize-path had conflicts in sell.go
that dropped 3 blocks of code from main:

1. obol sell inference — cluster-aware routing: detects k3d cluster,
   creates K8s Service+Endpoints bridge to host gateway, creates
   ServiceOffer, auto-starts tunnel via EnsureTunnelForSell()

2. obol sell http — auto-tunnel: calls EnsureTunnelForSell() after
   creating a ServiceOffer so the endpoint is immediately public

3. obol sell delete — auto-stop tunnel: when the last ServiceOffer is
   deleted, stops the quick tunnel and removes the storefront

Also restores:
- NoPaymentGate field on Deployment and GatewayConfig structs
- createHostService(), resolveHostIP(), buildInferenceServiceOfferSpec()
- net, runtime, strconv, stack imports

* security: pin LiteLLM image to v1.82.3 — supply chain compromise

LiteLLM PyPI packages 1.82.7 and 1.82.8 contain a malicious .pth file
(litellm_init.pth) that exfiltrates environment variables, SSH keys,
cloud credentials, and Kubernetes configs to an external endpoint.

See: BerriAI/litellm#24512

Our template used the floating tag `main-stable` which could pull a
compromised build. Pin to `main-v1.82.3` (confirmed safe, matches
the version currently running in our clusters).

Never use floating tags for security-sensitive dependencies.

* fix: restore /skill.md publishing dropped during cherry-pick

The cherry-pick of PR #265 dropped _build_skill_md() and
_publish_skill_md() from monetize.py, along with their 3 call sites
in cmd_process. This meant /skill.md would never be created or updated
on a fresh cluster.

Restores:
- _build_skill_md(): generates service catalog markdown from Ready offers
- _publish_skill_md(): creates/updates ConfigMap + Deployment + Service +
  HTTPRoute for the /skill.md endpoint
- 3 call sites in cmd_process:
  1. Empty skill.md when no offers exist
  2. Full skill.md when all offers are Ready
  3. Regenerate after reconciliation loop

* test: add tunnel lifecycle test coverage

No tests existed for tunnel state persistence or auto-stop decision
logic — this is why the cherry-pick drift went undetected.

New tests in tunnel_lifecycle_test.go:
- State round-trip (save/load, quick & dns modes)
- Missing state file returns (nil, nil)
- State overwrite replaces previous
- File permissions (0600 for non-secret metadata)
- UpdatedAt timestamp refresh on save
- tunnelModeAndURL derivation
- shouldAutoStopTunnel decision logic (5 cases covering the logic
  from sell delete: stop quick tunnels when empty, never stop dns)
- Exported LoadTunnelState wrapper

* chore: add golangci-lint v2 config with gofumpt formatting (#290)

* chore: update CLAUDE.md conventions and fix stale justfile commands

- Add Conventions section (conventional commits, branch prefixes, skill reference)
- Trim discoverable sections (Key Packages table, file-path tables, Embedded assets)
- Keep full LLM Routing, Security, Pitfalls sections
- Fix justfile: obol cluster → obol stack (stale command name)

* chore: add golangci-lint v2 config and PostToolUse format hook

Aligned with ObolNetwork/charon lint stance, adapted for obol-stack:
- Disabled linters not applicable to CLI apps (wrapcheck, forbidigo, noctx, testpackage)
- Enabled gofumpt + goimports formatters
- Added PostToolUse hook to auto-lint Go files on Write|Edit

Many linters are disabled with TODO comments for incremental re-enablement.

* style: apply gofumpt and goimports formatting across codebase

Auto-fix from golangci-lint v2 with gofumpt + goimports formatters.
Also removes unused code: sellInfoCommand (cmd/obol/sell.go),
pubKeyBytes (internal/tee/attest_stub.go), and fixes dogsled/thelper/govet
warnings.

* security: re-enable gosec linter and fix all issues

- Tighten WriteFile permissions from 0644 to 0600 across 13 files
  (config, keys, kubeconfig — no reason for group/world read on local CLI)
- Exclude G204 (subprocess) and G101 (credential naming) — systemic
  false positives for a CLI tool that intentionally runs k3d/helm/kubectl
- Exclude gosec from internal/testutil/ (test-only helpers)
- Nolint G704 (SSRF), G705 (XSS), G122 (TOCTOU), G703 (path traversal)
  with explanations — all false positives for local config operations

* chore: re-enable 10 linters and fix all issues

Re-enabled: errcheck, errchkjson, gocritic, nilerr, nilnil,
nosprintfhostport, staticcheck, unconvert, wastedassign.

Fixes:
- errcheck: add _ = for best-effort calls (browser open, process kill,
  kubectl display, io.Copy to discard)
- nilerr: annotate intentional fallback patterns (missing config files,
  optional features, backward compat defaults)
- staticcheck: lowercase error strings (ST1005), rename hasId→hasID
  (ST1003), nolint deprecated elliptic.Marshal (SA1019)
- gocritic: fix appendAssign slice mutation bugs (copy before append),
  nolint CGo dupSubExpr false positives, exitAfterDefer in main
- unconvert: remove 5 unnecessary int() wrappers on cmd.Int()
- nosprintfhostport: use net.JoinHostPort in testutil
- wastedassign: use var declaration instead of empty init

Still disabled: goconst (32), unparam (28) — low priority.

* chore: re-enable goconst and unparam — all linters now active

- Extract provider name constants (ProviderOllama, ProviderAnthropic,
  ProviderOpenAI) into internal/model and use across model/openclaw
- Extract API key env var constants (envAnthropicAPIKey, envOpenAIAPIKey)
- Extract OS name constants in dns/resolver, update status messages in
  update package, version unknownValue constant
- Add isDevMode() helper in config to eliminate repeated "true" comparison
- Simplify buildLocalManagedSecretYAML: remove always-nil error return
  and unused hostname parameter
- Exclude goconst and unparam from test files (test fixtures, mock
  handler signatures)
- Nolint validateInstallOptions networkName — extensibility param

All linters from .golangci.yml are now active with zero issues.

* fix: address PR review feedback on lint config and hook

- Set fix: false in .golangci.yml — lint runs are now non-mutating by
  default, suitable for CI. Use --fix flag explicitly when wanted.
- Hook: use golangci-lint via PATH instead of hardcoded ~/.local/bin/
- Hook: use `golangci-lint fmt "$f"` to scope formatting to the edited
  file only, avoiding unrelated rewrites across the package

---------

Co-authored-by: bussyjd <bussyjd@users.noreply.github.com>

* refactor: move monetize reconciliation into serviceoffer controller

* test: validate PR299 seller buyer and discovery flows

* fix: publish discovery assets from controller

* refactor: enforce singleton registration ownership

* fix: clean merged registration schema

* fix: exclude .workspace from Docker build context

When OBOL_DEVELOPMENT=true, Docker builds from the project root pick up
.workspace/data/ directories that contain root-owned PVC mounts from
previous clusters, causing "permission denied" errors during context
scanning.

Exclude .workspace/ and .worktrees/ from the Docker build context via
.dockerignore.

Fixes #304

* fix: harden facilitator flow and prewarm dev images

* feat: buy-side hardening and openclaw Docker fallback

- Add balance guard to buy.py — exits on insufficient USDC unless --force,
  tells the agent which wallet to fund
- Add end-to-end "Full Buy Flow" section to buy-inference SKILL.md covering
  discover → probe → buy → use paid/<model> → monitor/refill
- Add "After Discovery" cross-reference from discovery SKILL.md to buy-inference
- Add Docker image fallback for openclaw CLI install in obolup.sh for
  npm-deprived environments (ARM64 DGX Spark validated)

Validated full buyer→seller x402 commerce loop on DGX Spark (Base Sepolia):
seller registers on ERC-8004, obol-agent discovers via registry, probes 402,
pre-signs ERC-3009 auths, paid/qwen3.5:9b routes through x402-buyer sidecar.

* fix: use k3d pull-through caches in dev mode

* fix: stageDefaultSkills merges new skills into existing deployments

Previously, stageDefaultSkills skipped entirely if the skills/ directory
already existed. This meant skills added to the binary after initial
deployment (e.g. buy-inference, discovery) were never staged on re-sync,
requiring a manual `obol openclaw skills sync --from` workaround.

Now always runs CopySkills, which writes embedded files without deleting
user-added content. Validated on DGX Spark (ARM64): obol openclaw sync
correctly re-stages all 21 skills into an existing deployment, and the
full x402 commerce loop (probe → buy 5 auths → 5/5 paid inference →
exact depletion on 6th request) passes.

* fix: harden serviceoffer controller reconciliation

* fix: review cleanup — reduce diff noise and fix shadowed builtins

- Rename `copy` → `patched` in controller.go to avoid shadowing the
  built-in copy function (4 occurrences)
- Revert cosmetic whitespace changes in kubectl.go and verifier.go that
  added review noise without functional benefit
- Add ApplyOutput to kubectl.go cleanly atop the original formatting
- Remove SetRegistration/HandleWellKnown from verifier (now handled by
  serviceoffer-controller registration resources)
- Update CLAUDE.md dev constraints to mention serviceoffer-controller image

https://claude.ai/code/session_016ktZi3zLxbnWrjF8cLyTjm

* fix: x402 verifier TLS certs + consumer key for Base Sepolia USDC

Three fixes for cross-machine paid inference via facilitator.x402.rs:

1. x402-verifier CA certificates (x402.yaml):
   The verifier container image lacks a system CA bundle, causing TLS
   failures when connecting to facilitator.x402.rs:
     x509: certificate signed by unknown authority
   Added a ca-certificates ConfigMap (populated at deploy time from the
   host cert store) mounted at /etc/ssl/certs in the verifier pod.

2. Consumer key: Hardhat #9 -> Hardhat #0 (lib.sh, flow-08-buy.sh):
   Base Sepolia USDC (0x036CbD53) rejects ERC-3009
   transferWithAuthorization signatures from Hardhat accounts #1-#9
   (FiatTokenV2: invalid signature). Only HH#0 and custom keys pass
   validation. Root cause unclear but consistently reproducible.

Tested: Real on-chain USDC settlement on Base Sepolia confirmed.
TX: 0xa95cc751a3cd010acde7814719db0f539a53829849c4aba05ebb624009e20a4f
Seller +0.001 USDC, Buyer -0.001 USDC.

* sec: derive test keys from mnemonic at runtime + gitleaks pre-commit

- Replace all hardcoded Hardhat private keys with runtime derivation
  from the well-known test mnemonic via cast wallet derive-private-key
- flow-08-buy.sh reads CONSUMER_PRIVATE_KEY from env (set by lib.sh)
- flow-10 derives FACILITATOR_SIGNER_KEY via hh_key() helper
- Add gitleaks pre-commit hook to block accidental secret commits
- Add .gitleaks.toml allowlist for the public test mnemonic

Zero private key material in source. Keys are derived at runtime
from: test test test test test test test test test test test junk

* chore: extract reth-erc8004-indexer from monetize PR

Remove the standalone Rust reth-erc8004-indexer and its Dockerfile from
this PR. The indexer is not wired into any Helm chart, CLI command, or
infrastructure template — it's dead weight in a +23K line PR.

Tracked separately for integration as its own PR once the wiring into
`obol network install ethereum --indexer` is designed.

* fix: wait for verifier readiness probe after restart in BDD tests

waitForVerifierRestart only checked pod logs for the new facilitator
URL, but the pod could log it before its HTTP listener was ready to
serve requests. This caused transient 500s in the next scenario when
the verifier was still initializing.

Add kubectl rollout status after the restart to block until the pod
passes its /readyz probe (returns 200 only after pricing config is
loaded). The log check remains as a secondary confirmation with a
shorter timeout.

* Review giant monetise pr (#316)

* Remove indexer that's supposedly not used. Will reintroduce as part of obol network install later

* Claudes fixes

* Rename probe to test, commit transcript

* Update cmd/obol/sell.go

Signed-off-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com>

---------

Signed-off-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com>
Co-authored-by: bussyjd <bussyjd@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com>
@bussyjd bussyjd mentioned this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant