fix(codex): route ChatGPT OAuth via CODEX_HOME env, bump OpenClaw to 2026.4.25#404
Conversation
…2026.4.25
The chatgpt_oauth path has been broken since it shipped:
* `_provider_block` wrote `{"openai-codex": {"codexHome": "..."}}` to
openclaw.json. `codexHome` is NOT an OpenClaw config key — it doesn't
exist anywhere in upstream, on any tag or branch. We invented it.
* OpenClaw 2026.4.5 (our pin) had zero pre-staged-auth.json support
anyway. Codex auth in 4.5 only worked via interactive browser OAuth,
which is broken in headless ECS regardless of config shape.
The actual upstream contract (added in v2026.4.7,
extensions/openai/openai-codex-cli-auth.ts):
* Read `${CODEX_HOME}/auth.json` (default `${HOME}/.codex/auth.json`).
* Expected payload: `{"auth_mode":"chatgpt","tokens":{access_token,
refresh_token, account_id?}}` — same shape we already write in
`pre_stage_codex_auth`.
Three coordinated changes:
1. `openclaw-version.json` + `apps/infra/openclaw/Dockerfile` →
`alpine/openclaw:2026.4.25` (4.7 minimum required for the feature;
4.25 is current stable with hardened OAuth refresh logic). Per-env
tags reset to `<upstream>-bootstrap` so build-openclaw-image.yml
rebuilds the extended image; the real `<upstream>-<sha>` tag will
land in a follow-up PR (matches the CI workflow's documented flow).
2. `core/containers/config.py` → drop the bogus `codexHome` entry.
Omit the `openai-codex` provider block entirely so the bundled
provider plugin's defaults apply; an empty `{}` would still fail
the base-schema validator (which requires `baseUrl` + `models`).
3. `core/containers/ecs_manager.py` → on the chatgpt_oauth path,
inject `CODEX_HOME=/home/node/.openclaw/codex` as an
`environment:` entry on the per-user task definition. The path is
the in-container view of `<EFS>/users/{user_id}/codex/` after the
per-user EFS access point chroots `/users/{user_id}` to
`/home/node/.openclaw`. `pre_stage_codex_auth` already writes
`auth.json` to that backend-side path, so OpenClaw finds it cold
on first boot.
Tests:
* `test_config_provider_routing` updated — chatgpt_oauth must NOT
emit an `openai-codex` provider entry.
* `test_provision_chatgpt_oauth_pre_stages_auth_before_service_create`
extended to assert the per-user task carries the CODEX_HOME env var.
* All 1085 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…table tag (#405) PR #404 set the pin to alpine/openclaw:2026.4.25, but upstream never published that exact tag — only 2026.4.25-slim and the 2026.4.25-beta.X series have published fat (non-slim) variants. Our Dockerfile extends with apt-get layers (ffmpeg, ripgrep, 1password-cli, etc.) so the slim base would break our skill bundling. Beta.11 is the latest 2026.4.25-line tag with the fat variant we need. Verified via Docker Hub API: curl -sL 'https://hub.docker.com/v2/repositories/alpine/openclaw/tags?name=2026.4.25' build-openclaw-image.yml run 25031784091 failed with: ERROR: docker.io/alpine/openclaw:2026.4.25: not found Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ecfb9a6d15
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if idx == 0 and environment_for_task: | ||
| # Same merge story for environment — keep CDK base entries | ||
| # (CHOKIDAR_USEPOLLING, CLAWHUB_WORKDIR) intact. | ||
| base_env = list(cd_copy.get("environment") or []) | ||
| cd_copy["environment"] = base_env + list(environment_for_task) |
There was a problem hiding this comment.
Keep CODEX_HOME when cloning task defs outside provisioning
chatgpt_oauth now depends on CODEX_HOME being present in the task definition, but this env merge only happens when environment_for_task is explicitly passed. Paths like resize_user_container still call _build_register_kwargs_from_base(...) without environment_for_task (see resize_user_container around lines 561-567), so a resize/re-register drops CODEX_HOME from the new revision. After that, OAuth users fall back to ~/.codex/auth.json and lose access to the EFS-staged credentials, so inference fails after a resize/redeploy cycle.
Useful? React with 👍 / 👎.
…71624b4 (#409) Verified in ECR: isol8/openclaw-extended:2026.4.25-slim-71624b4 digest sha256:0409487c9c3b9d2bdcf2f5386357c852aa3041abc7afd40be444e2381ee14e4a pushed 2026-04-27 23:42:31 EDT Built by build-openclaw-image run 25032408047 against main 71624b4 (the gh-via-apt fix from #408 on top of the 4.25-slim switch from #407 on top of the codex-auth env-var fix from #404). This unblocks the deploy chain — CDK has been failing every cycle since #404 because dev.tag pointed at the placeholder *-bootstrap value. Once this PR merges, deploy.yml will pull the new image and the per-user container task def will reference it on the next provision. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…milies (#410) The CDK base task def and per-user clones used to register into the same ECS family (`isol8-{env}-openclaw`). #299 had to add an SSM-pinned ARN to work around the resulting "describe-by-family returns a per-user clone" problem. Then today (#404→#409 deploy chain) we hit the OTHER edge of that workaround: the ARN is injected into the backend via `ecs.Secret.fromSsmParameter`, which resolves at TASK STARTUP only — never refreshes. The running backend started with SSM=rev 1011 cached and cloned per-user task defs from that stale revision through every subsequent CDK deploy, producing a per-user task def 1016 that pulled a placeholder image (`2026.4.25-bootstrap`) which doesn't exist in ECR. Two underlying causes: 1. Co-mingled families forced a workaround. 2. The workaround cached its input at startup. Fix the FAMILY problem and both downstream issues collapse: * `EcsManager._build_register_kwargs_from_base` now describes the bare base family (e.g. `isol8-dev-openclaw`) — that family contains ONLY CDK base revisions because per-user clones go to `<base>-user`. ECS returns the latest base revision deterministically. * Per-user clones register into `f"{base['family']}-user"` so they don't pollute the base family. Existing per-user task defs on the old family stay valid (they're full ARNs); they age out as users re-provision. * Drop the SSM param + `ECS_TASK_DEFINITION` env var + the `ecs.Secret.fromSsmParameter` wiring + the cached `self._task_def`. Less code, no startup cache to go stale, no incident class. * Drop the lingering CFN `exportValue` from container-stack (added in #299 to keep the cross-stack import alive across the SSM transition; no consumers now). Tests updated: - `test_clones_task_def_with_access_point` asserts describe-by-family AND that the registered family is `<base>-user`. - `test_resize_reads_env_from_base_not_current` docstring updated to reflect the new mechanism. - All 1085 unit tests pass. - 11 pre-existing CDK Jest failures unrelated to this change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
The
chatgpt_oauthprovisioning path has been broken since it shipped. Two compounding bugs:{"openai-codex": {"codexHome": "..."}}to openclaw.json.codexHomeis not an OpenClaw config field in any version, on any branch — we invented it. OpenClaw rejected the config and crashed the container. (Verified by reading upstream src + git log -S.)7e0e2f81e5addedextensions/openai/openai-codex-cli-auth.ts).What this PR does
openclaw-version.json+apps/infra/openclaw/Dockerfile→alpine/openclaw:2026.4.25(current stable, includes hardened OAuth refresh on top of the 4.7 base feature). Per-env tags set to<upstream>-bootstrapsobuild-openclaw-image.ymlrebuilds the extended image.core/containers/config.pyno longer emits theopenai-codexprovider block. Omitted entirely — the bundled provider plugin's defaults apply (and an empty{}would still fail the base schema'sbaseUrl/modelsvalidator).core/containers/ecs_manager.pyinjectsCODEX_HOME=/home/node/.openclaw/codexas anenvironment:entry on the per-user task. The path is the in-container view of<EFS>/users/{user_id}/codex/after the access-point chroot — exactly wherepre_stage_codex_authalready writesauth.json.Verification
Upstream source-grounded confirmation of the fix shape:
Test plan
pytest tests/unit— 1085 passtest_config_provider_routing.test_chatgpt_oauth_branchupdated to assert theopenai-codexblock is absenttest_provision_chatgpt_oauth_pre_stages_auth_before_service_createextended to assertCODEX_HOMEenv var lands on the per-user task<upstream>-bootstrap→<upstream>-<sha>oncebuild-openclaw-imagefinishesCaveats
🤖 Generated with Claude Code