Skip to content

bug(v6.0.2): shared RELAY_API_KEY across machines — broker exits at WebSocket subscription with 'Unable to connect' #797

@prefrontalsys

Description

@prefrontalsys

Bug: shared RELAY_API_KEY across machines — broker exits at WebSocket subscription with "Unable to connect"

Summary

In v6.0.2, the documented "set the same RELAY_API_KEY on multiple machines so brokers join the same workspace" pattern (which #789 builds on as a precondition) silently fails. The HTTP registration step succeeds — the cloud accepts the key and registers the agent — but the WebSocket subscription step that follows exits with "Unable to connect. Is the computer able to access the url?", killing the broker.

This blocks every multi-machine use case, including the messaging premise that #789 assumes works.

Environment

  • agent-relay: v6.0.2 (standalone binary install via install.sh)
  • Mac: macOS arm64, agent_name sc, launchd-managed broker, ran agent-relay cloud login (token in ~/.agent-relay/cloud-auth.json)
  • Linux VPS: Ubuntu aarch64, agent_name ubuntu, systemd-user managed broker, same cloud-auth.json copied via scp from Mac
  • agent-relay cloud whoami returns identical user/org/workspace on both: sr4001@gmail.com, Scot Campbell's Workspace, Default

Reproduction

  1. On Mac, agent-relay up --no-dashboard — broker starts, creates workspace, prints Workspace Key: rk_live_4165c2f458fc2976c0dd0ad092050afb
  2. Verify the key is live from VPS:
    curl -H "Authorization: Bearer rk_live_4165c2f..." https://api.relaycast.dev/v1/channels
    → 200 OK, channels listed (`general`, `engineering`)
    
  3. On VPS, set the env var and start:
    RELAY_API_KEY=rk_live_4165c2f... agent-relay up --no-dashboard
    
  4. Observed: Broker started. is logged, then the broker exits within ~1s with status code 1 and stderr Failed to start broker: Unable to connect. Is the computer able to access the url?
  5. Cross-check on the cloud during step 4 (via api.relaycast.dev /v1/agents):
    • ubuntu agent does appear in the list, status offline, with a released_at timestamp matching the broker exit
    • i.e., the agent registration HTTP call succeeded — the failure is downstream

Why I think this is the WS-subscription step

  • VPS network is fine (HTTP 200 to api.relaycast.dev/v1/channels)
  • Bearer auth works (the same key returns 200 from VPS via curl)
  • The broker logs Broker started. (after connect_relay completed) before the error appears
  • The "Unable to connect" message is identical to what surfaces from tokio_tungstenite failures elsewhere in the binary's strings (tokio_tungstenite::tls::encryption::rustls)

What works (control)

Without RELAY_API_KEY, both brokers run fine (each auto-creates its own workspace). Each can talk to its own agents on its own host. The bug is only triggered when a broker is told to join a workspace it didn't create.

Hypothesis on root cause

The WebSocket subscription requires an auth context tied to the creator of the workspace — perhaps the cloud token used at workspace creation gets bound to subsequent WS connections. A broker that joins via RELAY_API_KEY only has its own cloud token, which the WS endpoint won't accept for that workspace's channel.

If correct, the fix is either (a) accept any cloud-auth token from the same org for WS auth on any workspace owned by that org, or (b) provide a workspace.export / workspace.invite flow that issues a transferable WS-eligible credential.

Why this matters for #789

#789 ("remote spawn") explicitly states: "multiple brokers sharing a workspace key can exchange messages, DMs, and channel posts in real-time." That assumption is currently false in v6.0.2 — sharing a workspace key crashes the broker. Spawning across machines presupposes the messaging fabric works first; this bug must be fixed (or worked around) before #789 becomes meaningful.

Asks

  1. Confirm whether multi-machine RELAY_API_KEY sharing is an intended supported configuration in v6 or a regression from v5.
  2. If supported: surface the actual WS error (current "Unable to connect" is misleading — HTTP auth is fine).
  3. If not yet supported: document this clearly in the README and add a workspace export / workspace join <key> flow that produces the right credential bundle.
  4. Consider whether RELAY_WORKSPACES_JSON / RELAY_DEFAULT_WORKSPACE (visible in binary strings, undocumented) is the intended path here.

Logs / evidence available

Happy to attach:

  • journalctl --user -u agent-relay from VPS during the failing window
  • ~/Library/Logs/agent-relay.{out,err}.log from Mac (success case)
  • ~/.agent-relay/identity-debug.txt from both machines showing distinct agent_id / default_workspace values that converge when RELAY_API_KEY is shared (proving HTTP-side resolution works)
  • api.relaycast.dev/v1/agents listing from before/after the VPS attempt

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions