Skip to content

docs(migration): bedag/raw → base release ownership transfer script#528

Closed
bussyjd wants to merge 5 commits into
refactor/eliminate-bedag-raw-releasesfrom
docs/bedag-raw-migration-script
Closed

docs(migration): bedag/raw → base release ownership transfer script#528
bussyjd wants to merge 5 commits into
refactor/eliminate-bedag-raw-releasesfrom
docs/bedag-raw-migration-script

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 24, 2026

Summary

PR #523 relocates 6 bedag/raw helmfile releases into the base chart so the
stack has one source of truth for what ships in the erpc, obol-frontend,
and llm namespaces. Fresh installs are unaffected. Clusters created before
PR #523 fail at helm upgrade base with invalid ownership metadata because
Helm refuses to adopt resources owned by another release.

This PR ships a one-shot migration script and operator documentation so
existing clusters can be upgraded without hand-fixing ~10 resources.

Symptom

Error: UPGRADE FAILED: <resource> exists and cannot be imported into the
current release: invalid ownership metadata; annotation validation error:
key "meta.helm.sh/release-name" must equal "base"; current value is
"<legacy-release>"

Before / After

BEFORE (pre-#523 cluster after upgrade attempt)
  helmfile.yaml:
    - obol-frontend-rbac     (bedag/raw)  ──┐
    - obol-frontend-httproute (bedag/raw)   │
    - erpc-httproute         (bedag/raw)   ├─ each owns its own resources
    - erpc-x402-middleware   (bedag/raw)   │  with meta.helm.sh/release-name
    - erpc-metadata          (bedag/raw)   │  ≠ "base"
    - llm-buyer-podmonitor   (bedag/raw)  ──┘
  base chart upgrade ─► invalid ownership metadata, abort.

AFTER (script run, then obol stack up)
  All resources annotated:
    meta.helm.sh/release-name=base
    meta.helm.sh/release-namespace=kube-system
    app.kubernetes.io/managed-by=Helm
  base chart upgrade ─► clean adoption, helm upgrade succeeds.

When to run

bash hack/migrate-bedag-raw-to-base.sh
obol stack up

Releases handled

Legacy release Namespace
obol-frontend-rbac obol-frontend
obol-frontend-httproute obol-frontend
erpc-httproute erpc
erpc-x402-middleware erpc
erpc-metadata erpc
llm-buyer-podmonitor llm
x402-verifier-podmonitor x402 (partial-upgrade clusters from before PR #513)

Plus three resources that may exist with no Helm ownership at all and need
to be adopted into base: namespace/erpc, namespace/obol-frontend,
prometheusrule/x402-verifier.

Files

  • hack/migrate-bedag-raw-to-base.sh — the migration script (executable).
  • docs/upgrade-from-pre-pr-523.md — operator-facing upgrade guide.
  • .github/release-template.md — release-notes entry under
    Breaking changes / Migration notes pointing future release authors at the
    script.

Test plan

  • bash -n hack/migrate-bedag-raw-to-base.sh (syntax check) passes.
  • Script was executed manually during the 14-PR integration test campaign
    against a pre-refactor: relocate remaining bedag/raw helmfile releases into base chart #523 cluster and confirmed to unblock obol stack up — every
    invalid ownership metadata failure resolved on the next upgrade.
  • Re-run a second time on the same cluster to confirm idempotency
    (every resource reports already on base, skipping).
  • On a fresh cluster, confirm the script is a no-op (no orphan releases
    found) and does not affect obol stack up.

Surfaced by the 14-PR integration test campaign; see
plans/integration-test-results-final-20260524.md Bug #2.

HananINouman and others added 5 commits May 22, 2026 22:53
PR #481 only repaired hermes-<id> volumes after hermes.Sync (master agent).
Child agents live under agent-<name> and are provisioned by the controller or
agent-factory without that path, so hermes-data stayed 1000:1000 while Hermes
runs as 10000:10000 and crash-looped on Permission denied under /data/.hermes.

Extend EnsureHermesDataPVCOwnership to agent-<name>/hermes-data, call it from
obol agent new and obol sell demo quant, and add obol agent repair-perms for
factory-only creates that cannot docker-exec the k3d node from in-cluster.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace host-side Hermes PVC ownership repair with Kubernetes fsGroup and keep only a tiny k3d fallback.
PR #511's host-side chown workaround was superseded by PR #514. This merge records the conflict resolution while keeping main's native Kubernetes fsGroup implementation.
PR #523 moved 6 bedag/raw helmfile releases into the base chart so
there's one source of truth for what ships in each namespace. Fresh
installs work. EXISTING clusters being upgraded from pre-#523
obol-stack fail at `helm upgrade base` with:

  Error: UPGRADE FAILED: <resource> exists and cannot be imported
  into the current release: invalid ownership metadata; annotation
  validation error: key "meta.helm.sh/release-name" must equal "base"

This blocks `obol stack up` until the operator manually re-annotates
~10 resources (Namespaces, HTTPRoutes, Middlewares, ConfigMaps,
PrometheusRule, PodMonitor, ClusterRole/Binding).

Adds hack/migrate-bedag-raw-to-base.sh which finds all such orphans
and re-annotates them in bulk. Idempotent — safe to re-run.

Surfaced by the 14-PR integration test campaign; see
plans/integration-test-results-final-20260524.md Bug #2.
@bussyjd bussyjd changed the base branch from main to refactor/eliminate-bedag-raw-releases May 24, 2026 06:59
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants