Skip to content

fix(openclaw): roll back to 2026.4.22 fat — escape v4.25 NFS+SQLite hang#415

Merged
prez2307 merged 1 commit into
mainfrom
fix/disable-channel-plugins
Apr 29, 2026
Merged

fix(openclaw): roll back to 2026.4.22 fat — escape v4.25 NFS+SQLite hang#415
prez2307 merged 1 commit into
mainfrom
fix/disable-channel-plugins

Conversation

@prez2307
Copy link
Copy Markdown
Contributor

Summary

  • Pin openclaw-version.json to alpine/openclaw:2026.4.22 (fat variant). Drops -slim.
  • Add three schema-required fields to agents.defaults (embeddedHarness, contextLimits, heartbeat) — required by v2026.4.22 zod schema.
  • Re-enable channel plugins (telegram, discord, slackenabled: true); the disable was a v4.25-specific defensive patch.

Why

v2026.4.25-slim deterministically wedges every container start. From a live ECS-exec probe on the running gateway:

PID 52 (openclaw-gateway):
  State: D (uninterruptible disk sleep)
  wchan: rpc_wait_bit_killable
  fds:   /home/node/.openclaw/tasks/runs.sqlite{,-wal,-shm}
  mount: 127.0.0.1:/  /home/node/.openclaw  nfs4  hard,port=21005

Matches upstream issue #73517 — task-registry hot-loop on stale runs.sqlite (reported against the same commit aa36ee6). Loopback NFS server inside the openclaw container deadlocks the gateway's main JS thread.

Forward path is blocked too: v2026.4.26 has an unfixed acpx-EPERM regression on remote filesystems (#73333, fix PR #73341 closed but not merged).

So the only safe move is back. We previously ran on 2026.4.22 fat in #406 with no hang. Fat variant has all bundled plugin runtime deps prebaked → no 90s install penalty on first boot. Has CODEX_HOME (added 4.7) so ChatGPT OAuth works.

Test plan

  • CI builds extended image off 2026.4.22 upstream
  • Dev redeploys cleanly
  • Provision a fresh container, watch CloudWatch logs reach [gateway] ready and stay healthy past starting channels and sidecars... (the wedge point)
  • Backend gateway connection pool establishes WS handshake (no more [ws] closed before connect code=1006)
  • ChatGPT-OAuth signup path completes end-to-end

Risk

  • New tag 2026.4.22-bootstrap won't resolve until the extended-image CI workflow runs once and pushes a per-commit tag. First deploy after merge will fail; subsequent deploys after the image build are fine.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a861721a1a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread openclaw-version.json
Comment on lines +8 to +9
"dev": { "tag": "2026.4.22-bootstrap" },
"prod": { "tag": "2026.4.22-bootstrap" },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use the literal bootstrap sentinel for unresolved env tags

Setting dev.tag/prod.tag to 2026.4.22-bootstrap bypasses the fallback logic in apps/infra/lib/stacks/container-stack.ts (it only falls back when envTag === "bootstrap"), so deploys will always try to pull an ECR tag immediately. In the same repo, .github/workflows/build-openclaw-image.yml publishes tags as ${UPSTREAM}-${SHORT_SHA}, so CI never creates 2026.4.22-bootstrap; this leaves dev/prod pointing at a tag that is not produced by automation and causes task start failures when that image is selected.

Useful? React with 👍 / 👎.

Comment thread openclaw-version.json
@@ -1,11 +1,11 @@
{
"$schema": "./openclaw-version.schema.json",
"upstream": "alpine/openclaw:2026.4.25-slim",
"upstream": "alpine/openclaw:2026.4.22",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep upstream version pin aligned with Dockerfile FROM

This changes the declared upstream pin to alpine/openclaw:2026.4.22, but apps/infra/openclaw/Dockerfile still bases the extended image on alpine/openclaw:2026.4.25-slim (FROM line 63). Because build-openclaw-image.yml derives the pushed tag name from openclaw-version.json.upstream, newly built extended images will be labeled as 2026.4.22-* while still containing the 2026.4.25-slim base, so the rollback is not actually applied on the extended-image path.

Useful? React with 👍 / 👎.

v2026.4.25-slim wedges every container start: gateway main thread enters
uninterruptible NFS RPC wait (rpc_wait_bit_killable) on
~/.openclaw/tasks/runs.sqlite via OpenClaw's loopback-NFS layer
(127.0.0.1:21005). Matches upstream issue #73517 ("Gateway task registry
maintenance can hot-loop on stale runs.sqlite"), reproduced against
2026.4.25 (aa36ee6). v2026.4.26 partially fixes the WAL growth side
(#72774) but introduces an unfixed acpx EPERM regression on remote FS
(#73333), so we can't move forward — only back.

2026.4.22 fat predates #73517, has CODEX_HOME (added 4.7) so ChatGPT
OAuth still works, and bundles all plugin runtime deps in-image so
first boot doesn't pay the 90s slim install penalty. We previously
ran on 4.22 fat in PR #406 without this hang.

Schema-compliance changes (zod-schema.agent-defaults.ts at v2026.4.22
requires these three fields, no .optional()):
- agents.defaults.embeddedHarness: {} (line 42)
- agents.defaults.contextLimits: {}  (line 115)
- agents.defaults.heartbeat: {}      (line 251)

Also reverts the channel-disable defensive patch from #413: the no-account
enabled:true channel-plugin behavior was a v4.25 sidecar bug, not a 4.22
issue. Channels are back to enabled:true so first-pair stays a fast
hot-reload instead of a 6-min full gateway restart on Fargate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prez2307 prez2307 force-pushed the fix/disable-channel-plugins branch from a861721 to ef64631 Compare April 29, 2026 04:21
@prez2307 prez2307 merged commit dbf5da0 into main Apr 29, 2026
1 check passed
prez2307 added a commit that referenced this pull request Apr 29, 2026
#416)

Build workflow on PR #415 pushed 877352799272.dkr.ecr.us-east-1.amazonaws.com/isol8/openclaw-extended:2026.4.22-dbf5da0 to ECR (digest sha256:b75fd3a0). Switching dev+prod off the unresolvable -bootstrap placeholder.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prez2307 added a commit that referenced this pull request Apr 29, 2026
… actual rollback (#418)

PR #415 + #416 only updated openclaw-version.json but the extended Dockerfile
keeps a hardcoded `FROM alpine/openclaw:2026.4.25-slim` (line 63). The build
workflow tags the resulting image as 2026.4.22-<sha> — but it was always built
FROM 4.25-slim. So every "rollback test" was still on the broken 4.25-slim
base, with all of #73517's NFS+SQLite hang behavior intact. Same hang every
time, just labeled differently.

The Dockerfile's own header comment (lines 12-13) literally says: "Bump
UPSTREAM = openclaw-version.json#upstream field. Keep the FROM lines below in
sync with that field manually until automation lands." — the manual sync was
missed.

Fix: change line 63 to FROM alpine/openclaw:2026.4.22 (the fat upstream tag,
which exists on Docker Hub, last_updated 2026-04-23, 1.2 GB).

Once this PR merges, build-openclaw-image.yml will produce a new ECR tag
2026.4.22-<sha> built FROM the actual 4.22 base. A follow-up bump of
openclaw-version.json#dev.tag/prod.tag will deploy it.

Includes the openclaw-version.json from origin/main HEAD (b590890) so the
diff is purely the Dockerfile FROM line.

TODO follow-up: parameterize the FROM line via build-arg so this can't drift
from openclaw-version.json again.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant