Skip to content

fix(openclaw): disable channels at scaffold — v2026.4.25 sidecar hang#413

Merged
prez2307 merged 1 commit into
mainfrom
fix/disable-channel-plugins
Apr 29, 2026
Merged

fix(openclaw): disable channels at scaffold — v2026.4.25 sidecar hang#413
prez2307 merged 1 commit into
mainfrom
fix/disable-channel-plugins

Conversation

@prez2307
Copy link
Copy Markdown
Contributor

Summary

OpenClaw v2026.4.25's startGatewaySidecars() hangs forever when channel plugins (telegram/discord/slack) are loaded with enabled: true but no configured accounts.

Symptom

Container logs:

[gateway] starting channels and sidecars...

…but never completes (no "sidecars ready" line for the entire 30+ minute container lifetime). HTTP request handler awaits getReadiness() which gates on onSidecarsReady() — that callback fires only after startGatewaySidecars() resolves. So every WebSocket upgrade and HTTP request hangs.

User-visible: backend logs RPC health failed: timed out during opening handshake, frontend shows "Starting your container" forever.

Root cause walk

  1. Layer 1–3 (frontend → API GW → backend): all working
  2. Layer 4 (backend → openclaw container): TCP connects, but HTTP/WS upgrade gets no response
  3. Inside the container: Node main process is alive (CPU ~12%, logs still flushing), but HTTP request handler hangs
  4. OpenClaw source trace: getReadiness() returns {ready: false} until startupSidecarsReady = true. That flag is set in onSidecarsReady callback, which fires only after startGatewaySidecars() returns. With no-account channels enabled in 4.25, that promise never resolves.

We were shipping channels enabled: true for a hot-reload path tied to the old tier-upgrade flow. Post-flat-fee that doesn't apply — the user pairs a channel via channel_link_service which patches the config to flip the specific provider on. Starting from disabled and toggling per-provider is the correct flow.

Test plan

  • pytest tests/unit/containers/test_config_provider_routing.py — 8 pass
  • Watch deploy + per-user container reaches "sidecars ready" + backend gateway pool connects
  • User-facing: container flips to running in DDB

Note

If after this the sidecar still hangs, the next suspects are phone-control or talk-voice (new auto-loaded plugins in 4.25). Both can be disabled via plugins.deny in a follow-up. Did not include those here so we can isolate which plugin was actually responsible.

🤖 Generated with Claude Code

OpenClaw v2026.4.25's startGatewaySidecars() hangs forever when channel
plugins (telegram/discord/slack) are loaded with enabled=true but no
configured accounts. We were shipping all three enabled at scaffold for
a hot-reload path that no longer applies post-flat-fee.

Symptom: container logs `[gateway] starting channels and sidecars...`
but never completes. HTTP request handler awaits getReadiness() which
gates on onSidecarsReady() — never fires → every WS upgrade and HTTP
request hangs until the client times out.

Each channel is re-enabled per-provider via channel_link_service when
the user actually pairs one. Hot-reload works fine when starting from
disabled — the original "ship them all enabled to avoid full-gateway-
restart cost" reasoning was a tier-upgrade optimization that's dead
post-flat-fee anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prez2307 prez2307 merged commit 3743d2c into main Apr 29, 2026
1 check failed
@prez2307 prez2307 deleted the fix/disable-channel-plugins branch April 29, 2026 01:28
prez2307 added a commit that referenced this pull request Apr 29, 2026
v2026.4.25-slim wedges every container start: gateway main thread enters
uninterruptible NFS RPC wait (rpc_wait_bit_killable) on
~/.openclaw/tasks/runs.sqlite via OpenClaw's loopback-NFS layer
(127.0.0.1:21005). Matches upstream issue #73517 ("Gateway task registry
maintenance can hot-loop on stale runs.sqlite"), reproduced against
2026.4.25 (aa36ee6). v2026.4.26 partially fixes the WAL growth side
(#72774) but introduces an unfixed acpx EPERM regression on remote FS
(#73333), so we can't move forward — only back.

2026.4.22 fat predates #73517, has CODEX_HOME (added 4.7) so ChatGPT
OAuth still works, and bundles all plugin runtime deps in-image so
first boot doesn't pay the 90s slim install penalty. We previously
ran on 4.22 fat in PR #406 without this hang.

Schema-compliance changes (zod-schema.agent-defaults.ts at v2026.4.22
requires these three fields, no .optional()):
- agents.defaults.embeddedHarness: {} (line 42)
- agents.defaults.contextLimits: {}  (line 115)
- agents.defaults.heartbeat: {}      (line 251)

Also reverts the channel-disable defensive patch from #413: the no-account
enabled:true channel-plugin behavior was a v4.25 sidecar bug, not a 4.22
issue. Channels are back to enabled:true so first-pair stays a fast
hot-reload instead of a 6-min full gateway restart on Fargate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prez2307 added a commit that referenced this pull request Apr 29, 2026
…ang (#415)

v2026.4.25-slim wedges every container start: gateway main thread enters
uninterruptible NFS RPC wait (rpc_wait_bit_killable) on
~/.openclaw/tasks/runs.sqlite via OpenClaw's loopback-NFS layer
(127.0.0.1:21005). Matches upstream issue #73517 ("Gateway task registry
maintenance can hot-loop on stale runs.sqlite"), reproduced against
2026.4.25 (aa36ee6). v2026.4.26 partially fixes the WAL growth side
(#72774) but introduces an unfixed acpx EPERM regression on remote FS
(#73333), so we can't move forward — only back.

2026.4.22 fat predates #73517, has CODEX_HOME (added 4.7) so ChatGPT
OAuth still works, and bundles all plugin runtime deps in-image so
first boot doesn't pay the 90s slim install penalty. We previously
ran on 4.22 fat in PR #406 without this hang.

Schema-compliance changes (zod-schema.agent-defaults.ts at v2026.4.22
requires these three fields, no .optional()):
- agents.defaults.embeddedHarness: {} (line 42)
- agents.defaults.contextLimits: {}  (line 115)
- agents.defaults.heartbeat: {}      (line 251)

Also reverts the channel-disable defensive patch from #413: the no-account
enabled:true channel-plugin behavior was a v4.25 sidecar bug, not a 4.22
issue. Channels are back to enabled:true so first-pair stays a fast
hot-reload instead of a 6-min full gateway restart on Fargate.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant