fix(openclaw): roll back to 2026.4.22 fat — escape v4.25 NFS+SQLite hang#415
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a861721a1a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "dev": { "tag": "2026.4.22-bootstrap" }, | ||
| "prod": { "tag": "2026.4.22-bootstrap" }, |
There was a problem hiding this comment.
Use the literal bootstrap sentinel for unresolved env tags
Setting dev.tag/prod.tag to 2026.4.22-bootstrap bypasses the fallback logic in apps/infra/lib/stacks/container-stack.ts (it only falls back when envTag === "bootstrap"), so deploys will always try to pull an ECR tag immediately. In the same repo, .github/workflows/build-openclaw-image.yml publishes tags as ${UPSTREAM}-${SHORT_SHA}, so CI never creates 2026.4.22-bootstrap; this leaves dev/prod pointing at a tag that is not produced by automation and causes task start failures when that image is selected.
Useful? React with 👍 / 👎.
| @@ -1,11 +1,11 @@ | |||
| { | |||
| "$schema": "./openclaw-version.schema.json", | |||
| "upstream": "alpine/openclaw:2026.4.25-slim", | |||
| "upstream": "alpine/openclaw:2026.4.22", | |||
There was a problem hiding this comment.
Keep upstream version pin aligned with Dockerfile FROM
This changes the declared upstream pin to alpine/openclaw:2026.4.22, but apps/infra/openclaw/Dockerfile still bases the extended image on alpine/openclaw:2026.4.25-slim (FROM line 63). Because build-openclaw-image.yml derives the pushed tag name from openclaw-version.json.upstream, newly built extended images will be labeled as 2026.4.22-* while still containing the 2026.4.25-slim base, so the rollback is not actually applied on the extended-image path.
Useful? React with 👍 / 👎.
v2026.4.25-slim wedges every container start: gateway main thread enters
uninterruptible NFS RPC wait (rpc_wait_bit_killable) on
~/.openclaw/tasks/runs.sqlite via OpenClaw's loopback-NFS layer
(127.0.0.1:21005). Matches upstream issue #73517 ("Gateway task registry
maintenance can hot-loop on stale runs.sqlite"), reproduced against
2026.4.25 (aa36ee6). v2026.4.26 partially fixes the WAL growth side
(#72774) but introduces an unfixed acpx EPERM regression on remote FS
(#73333), so we can't move forward — only back.
2026.4.22 fat predates #73517, has CODEX_HOME (added 4.7) so ChatGPT
OAuth still works, and bundles all plugin runtime deps in-image so
first boot doesn't pay the 90s slim install penalty. We previously
ran on 4.22 fat in PR #406 without this hang.
Schema-compliance changes (zod-schema.agent-defaults.ts at v2026.4.22
requires these three fields, no .optional()):
- agents.defaults.embeddedHarness: {} (line 42)
- agents.defaults.contextLimits: {} (line 115)
- agents.defaults.heartbeat: {} (line 251)
Also reverts the channel-disable defensive patch from #413: the no-account
enabled:true channel-plugin behavior was a v4.25 sidecar bug, not a 4.22
issue. Channels are back to enabled:true so first-pair stays a fast
hot-reload instead of a 6-min full gateway restart on Fargate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a861721 to
ef64631
Compare
… actual rollback (#418) PR #415 + #416 only updated openclaw-version.json but the extended Dockerfile keeps a hardcoded `FROM alpine/openclaw:2026.4.25-slim` (line 63). The build workflow tags the resulting image as 2026.4.22-<sha> — but it was always built FROM 4.25-slim. So every "rollback test" was still on the broken 4.25-slim base, with all of #73517's NFS+SQLite hang behavior intact. Same hang every time, just labeled differently. The Dockerfile's own header comment (lines 12-13) literally says: "Bump UPSTREAM = openclaw-version.json#upstream field. Keep the FROM lines below in sync with that field manually until automation lands." — the manual sync was missed. Fix: change line 63 to FROM alpine/openclaw:2026.4.22 (the fat upstream tag, which exists on Docker Hub, last_updated 2026-04-23, 1.2 GB). Once this PR merges, build-openclaw-image.yml will produce a new ECR tag 2026.4.22-<sha> built FROM the actual 4.22 base. A follow-up bump of openclaw-version.json#dev.tag/prod.tag will deploy it. Includes the openclaw-version.json from origin/main HEAD (b590890) so the diff is purely the Dockerfile FROM line. TODO follow-up: parameterize the FROM line via build-arg so this can't drift from openclaw-version.json again. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
openclaw-version.jsontoalpine/openclaw:2026.4.22(fat variant). Drops-slim.agents.defaults(embeddedHarness,contextLimits,heartbeat) — required by v2026.4.22 zod schema.telegram,discord,slack→enabled: true); the disable was a v4.25-specific defensive patch.Why
v2026.4.25-slim deterministically wedges every container start. From a live ECS-exec probe on the running gateway:
Matches upstream issue #73517 — task-registry hot-loop on stale
runs.sqlite(reported against the same commitaa36ee6). Loopback NFS server inside the openclaw container deadlocks the gateway's main JS thread.Forward path is blocked too: v2026.4.26 has an unfixed acpx-EPERM regression on remote filesystems (#73333, fix PR #73341 closed but not merged).
So the only safe move is back. We previously ran on 2026.4.22 fat in #406 with no hang. Fat variant has all bundled plugin runtime deps prebaked → no 90s install penalty on first boot. Has
CODEX_HOME(added 4.7) so ChatGPT OAuth works.Test plan
2026.4.22upstream[gateway] readyand stay healthy paststarting channels and sidecars...(the wedge point)[ws] closed before connect code=1006)Risk
2026.4.22-bootstrapwon't resolve until the extended-image CI workflow runs once and pushes a per-commit tag. First deploy after merge will fail; subsequent deploys after the image build are fine.🤖 Generated with Claude Code