Skip to content

fix(tunnel): add --overwrite-dns to obol tunnel login/setup#471

Merged
bussyjd merged 1 commit into
mainfrom
fix/tunnel-overwrite-dns
May 12, 2026
Merged

fix(tunnel): add --overwrite-dns to obol tunnel login/setup#471
bussyjd merged 1 commit into
mainfrom
fix/tunnel-overwrite-dns

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 12, 2026

Summary

Re-running obol tunnel login (or obol tunnel setup --management local) against a hostname that already has a DNS record fails with Cloudflare API error 1003:

cloudflared tunnel route dns failed: exit status 1
Failed to add route: code: 1003, reason: Failed to create record
inference.example.com with err An A, AAAA, or CNAME record with that
host already exists.

cloudflared has supported --overwrite-dns on tunnel route dns for years (also TUNNEL_FORCE_PROVISIONING_DNS) — the obol wrapper just never passed it.

This adds the --overwrite-dns flag to both obol tunnel login and obol tunnel setup, plumbs it through LoginOptions/SetupOptions, and includes a hint on the existing error path so operators learn about the flag at the moment they need it.

Default is off — replacing an existing record is destructive (it could flatten a CNAME placed by someone else), so opt-in is the right shape. Remote-managed mode is unaffected because the Cloudflare API path already upserts.

Operators hit this when

  • They retry the wizard after fixing an earlier issue (e.g. wrong CF account on first try).
  • They move an existing hostname onto a fresh in-cluster tunnel (cluster recreated, new petname, same DNS target).
  • Surfaced today on spark2 while re-pointing inference.v1337.org from a stale (sacred-magpie) tunnel to a fresh dev-mode (merry-troll) tunnel.

Test plan

  • go test ./internal/tunnel -run TestRouteDNSArgs -count=1 — new helper passes both default and overwrite cases.
  • go build ./... clean.
  • Manual on spark2: obol tunnel login --hostname inference.v1337.org --overwrite-dns replaces the existing CNAME and finishes the wizard cleanly.

Full report (PR template) added as the first comment.

Re-running `obol tunnel login` or `obol tunnel setup --management local`
against a hostname that already has an A/AAAA/CNAME record fails with:

  cloudflared tunnel route dns failed: exit status 1
  Failed to add route: code: 1003, reason: Failed to create record
  inference.example.com with err An A, AAAA, or CNAME record with that
  host already exists.

cloudflared has supported `--overwrite-dns` on `tunnel route dns` for
years for exactly this case (also exposed as `TUNNEL_FORCE_PROVISIONING_DNS`)
but the obol wrapper never passed it.

Operators hit this whenever they:
- Retry a wizard run after fixing an earlier failure.
- Move an existing hostname over to a new in-cluster tunnel (cluster
  re-created, new petname, same DNS target intended).

Changes:
- internal/tunnel/login.go: add `OverwriteDNS bool` to LoginOptions and
  pass `--overwrite-dns` through a new `routeDNSArgs` helper. When the
  caller did not opt in and cloudflared returns the "record already
  exists" error, append a hint pointing at --overwrite-dns.
- internal/tunnel/domain_setup.go: add the same field to SetupOptions
  and forward it into the Login call in local-managed mode. Remote
  mode is unaffected because the Cloudflare API path already upserts.
- cmd/obol/tunnel_domain.go: expose `--overwrite-dns` on
  `obol tunnel login` and `obol tunnel setup`.
- Default stays false — the user has to opt in. The flag is genuinely
  destructive (you may flatten a CNAME that someone else placed), so
  surfacing it as a hint when the conflict happens is the right
  ergonomic shape.

Test plan:
- `go test ./internal/tunnel -run TestRouteDNSArgs` passes both
  default and overwrite cases.
- Manual: confirmed end-to-end on spark2 — the prior CNAME for
  `inference.v1337.org` was replaced when the wizard was re-run after
  the cluster was recreated.
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 12, 2026

Summary

What changed: internal/tunnel/login.go now appends --overwrite-dns to the cloudflared tunnel route dns invocation when the new OverwriteDNS option is set; LoginOptions and SetupOptions carry it; obol tunnel login and obol tunnel setup expose --overwrite-dns. When cloudflared returns "record already exists" without the flag, the error now also prints a one-line hint pointing at it.

Why it matters: obol tunnel login / obol tunnel setup --management local are guaranteed to fail on the second run for the same hostname because Cloudflare DNS rejects record collisions (API code 1003). Operators have to drop to raw cloudflared --origincert … tunnel route dns --overwrite-dns <UUID> <host> to recover, which requires reading internal CLI source to find the tunnel UUID. We hit this twice today on spark2.

Risk level: low

Commit under test: tip of fix/tunnel-overwrite-dns (1b25806)

Base branch: main

Scope

  • Code
  • Charts / manifests
  • Flows / QA scripts
  • Docs / skills
  • Images / dependencies
  • Other:

Validation

CI checks:

Check Status Link
GitHub Actions CI pending (set by PR open)

Unit tests:

$ go test ./internal/tunnel -run TestRouteDNSArgs -count=1
=== RUN   TestRouteDNSArgs
=== RUN   TestRouteDNSArgs/default_(no_overwrite)
=== RUN   TestRouteDNSArgs/overwrite-dns_inserted_before_tunnel/hostname
--- PASS: TestRouteDNSArgs (0.00s)
PASS
ok  	github.com/ObolNetwork/obol-stack/internal/tunnel	1.107s

Integration tests:

n/a — wizard is not yet covered by an integration test on the obol side.

Flow tests:

Flow Network QA machine label Worktree Result Artifacts
(none — flag is opt-in, default behaviour unchanged)

Release smoke:

n/a — no release artefact change.

Live Chain Evidence

n/a — this PR is CLI plumbing; no on-chain side effects.

Runtime Evidence

QA environment:

Item Value
OS / arch linux/arm64 (Ubuntu 24.04.4 LTS on NVIDIA GB10)
Backend k3d, dev mode (OBOL_DEVELOPMENT=true)
Tool versions obol@dev (go run wrapper, source at fix/tunnel-overwrite-dns), cloudflared 2026.3.0, k3d 5.8.3
QA agent/model n/a

Images: n/a

Kubernetes / stack:

Item Value
Stack IDs merry-troll
Namespaces traefik (cloudflared chart)
Pod readiness cloudflared 1/1 Running after re-run with --overwrite-dns
Cleanup result tunnel UUID a7729815-… retained; CNAME for the test hostname re-pointed at it

Model and routing: n/a

Artifacts and logs:

Artifact Location / link Notes
Before output reproduced in PR body API code 1003
After output reproduced in PR body wizard reaches Tunnel ready

Demo readiness:

Item Status Notes
Seller visible / registered n/a PR is plumbing
Buyer discovery works n/a
Paid route works n/a
Settlement visible on-chain n/a

Review Notes

Known gaps:

  • No automated end-to-end test of the wizard — TestRouteDNSArgs only covers argument construction. A live cloudflared test would require Cloudflare credentials in CI.

Follow-ups:

  • Apply the same -y/--yes aliasing pattern to other destructive commands (obol stack purge, obol agent delete, etc.) — filed separately (fix(sell): accept --yes / -y as aliases for --force on sell delete #472 starts the scope with sell delete).
  • Consider auto-detecting "the existing record already points at our tunnel" and retrying with overwrite silently for that idempotent case. Out of scope here because it needs an API call against the zone to read the existing record.

Reviewer focus:

  • Is "opt-in --overwrite-dns" the right shape, or should we always pass it? Argument against always: stomping a CNAME that someone else placed.
  • The error-path hint string in login.go. Acceptable phrasing?

@bussyjd bussyjd merged commit 163bb91 into main May 12, 2026
6 checks passed
@bussyjd bussyjd deleted the fix/tunnel-overwrite-dns branch May 12, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant