Skip to content

fix(codex): route ChatGPT OAuth via CODEX_HOME env, bump OpenClaw to 2026.4.25#404

Merged
prez2307 merged 1 commit into
mainfrom
fix/codex-auth-via-env-var
Apr 28, 2026
Merged

fix(codex): route ChatGPT OAuth via CODEX_HOME env, bump OpenClaw to 2026.4.25#404
prez2307 merged 1 commit into
mainfrom
fix/codex-auth-via-env-var

Conversation

@prez2307
Copy link
Copy Markdown
Contributor

Summary

The chatgpt_oauth provisioning path has been broken since it shipped. Two compounding bugs:

  1. We wrote {"openai-codex": {"codexHome": "..."}} to openclaw.json. codexHome is not an OpenClaw config field in any version, on any branch — we invented it. OpenClaw rejected the config and crashed the container. (Verified by reading upstream src + git log -S.)
  2. Even with valid config, our pinned v2026.4.5 has no pre-staged-auth.json support at all — Codex auth in 4.5 was interactive-browser-only, broken in headless ECS regardless. Pre-staged-auth.json reading first landed in v2026.4.7 (commit 7e0e2f81e5 added extensions/openai/openai-codex-cli-auth.ts).

What this PR does

  • OpenClaw bump: openclaw-version.json + apps/infra/openclaw/Dockerfilealpine/openclaw:2026.4.25 (current stable, includes hardened OAuth refresh on top of the 4.7 base feature). Per-env tags set to <upstream>-bootstrap so build-openclaw-image.yml rebuilds the extended image.
  • Drop fake config knob: core/containers/config.py no longer emits the openai-codex provider block. Omitted entirely — the bundled provider plugin's defaults apply (and an empty {} would still fail the base schema's baseUrl/models validator).
  • Use the real one: core/containers/ecs_manager.py injects CODEX_HOME=/home/node/.openclaw/codex as an environment: entry on the per-user task. The path is the in-container view of <EFS>/users/{user_id}/codex/ after the access-point chroot — exactly where pre_stage_codex_auth already writes auth.json.

Verification

Upstream source-grounded confirmation of the fix shape:

// extensions/openai/openai-codex-cli-auth.ts (HEAD)
function resolveCodexCliHome(env) {
  const configured = trimNonEmptyString(env.CODEX_HOME);
  if (!configured) return path.join(resolveRequiredHomeDir(), ".codex");
  return path.resolve(configured);
}
function readCodexCliAuthFile(env) {
  const authPath = path.join(resolveCodexCliHome(env), "auth.json");
  ...
}

Test plan

  • pytest tests/unit — 1085 pass
  • test_config_provider_routing.test_chatgpt_oauth_branch updated to assert the openai-codex block is absent
  • test_provision_chatgpt_oauth_pre_stages_auth_before_service_create extended to assert CODEX_HOME env var lands on the per-user task
  • Watch deploy + smoke-test on dev with a fresh ChatGPT OAuth flow
  • Follow-up PR to bump per-env tags from <upstream>-bootstrap<upstream>-<sha> once build-openclaw-image finishes

Caveats

  • The bootstrap tag intentionally won't resolve. CDK deploy on this PR's main-merge will fail to pull the image until the follow-up tag-bump PR lands. That's the project's documented two-PR flow for OpenClaw image bumps; not a regression.
  • Once the new image is in ECR, existing dev containers need to be re-provisioned to pick up the new openclaw.json shape and CODEX_HOME env. We'll do that as part of the dev clean test.

🤖 Generated with Claude Code

…2026.4.25

The chatgpt_oauth path has been broken since it shipped:

* `_provider_block` wrote `{"openai-codex": {"codexHome": "..."}}` to
  openclaw.json. `codexHome` is NOT an OpenClaw config key — it doesn't
  exist anywhere in upstream, on any tag or branch. We invented it.
* OpenClaw 2026.4.5 (our pin) had zero pre-staged-auth.json support
  anyway. Codex auth in 4.5 only worked via interactive browser OAuth,
  which is broken in headless ECS regardless of config shape.

The actual upstream contract (added in v2026.4.7,
extensions/openai/openai-codex-cli-auth.ts):

* Read `${CODEX_HOME}/auth.json` (default `${HOME}/.codex/auth.json`).
* Expected payload: `{"auth_mode":"chatgpt","tokens":{access_token,
  refresh_token, account_id?}}` — same shape we already write in
  `pre_stage_codex_auth`.

Three coordinated changes:

1. `openclaw-version.json` + `apps/infra/openclaw/Dockerfile` →
   `alpine/openclaw:2026.4.25` (4.7 minimum required for the feature;
   4.25 is current stable with hardened OAuth refresh logic). Per-env
   tags reset to `<upstream>-bootstrap` so build-openclaw-image.yml
   rebuilds the extended image; the real `<upstream>-<sha>` tag will
   land in a follow-up PR (matches the CI workflow's documented flow).

2. `core/containers/config.py` → drop the bogus `codexHome` entry.
   Omit the `openai-codex` provider block entirely so the bundled
   provider plugin's defaults apply; an empty `{}` would still fail
   the base-schema validator (which requires `baseUrl` + `models`).

3. `core/containers/ecs_manager.py` → on the chatgpt_oauth path,
   inject `CODEX_HOME=/home/node/.openclaw/codex` as an
   `environment:` entry on the per-user task definition. The path is
   the in-container view of `<EFS>/users/{user_id}/codex/` after the
   per-user EFS access point chroots `/users/{user_id}` to
   `/home/node/.openclaw`. `pre_stage_codex_auth` already writes
   `auth.json` to that backend-side path, so OpenClaw finds it cold
   on first boot.

Tests:
* `test_config_provider_routing` updated — chatgpt_oauth must NOT
  emit an `openai-codex` provider entry.
* `test_provision_chatgpt_oauth_pre_stages_auth_before_service_create`
  extended to assert the per-user task carries the CODEX_HOME env var.
* All 1085 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prez2307 prez2307 merged commit 9d4388f into main Apr 28, 2026
1 check passed
@prez2307 prez2307 deleted the fix/codex-auth-via-env-var branch April 28, 2026 03:10
prez2307 added a commit that referenced this pull request Apr 28, 2026
…table tag (#405)

PR #404 set the pin to alpine/openclaw:2026.4.25, but upstream never
published that exact tag — only 2026.4.25-slim and the 2026.4.25-beta.X
series have published fat (non-slim) variants. Our Dockerfile extends
with apt-get layers (ffmpeg, ripgrep, 1password-cli, etc.) so the slim
base would break our skill bundling. Beta.11 is the latest 2026.4.25-line
tag with the fat variant we need.

Verified via Docker Hub API:
  curl -sL 'https://hub.docker.com/v2/repositories/alpine/openclaw/tags?name=2026.4.25'

build-openclaw-image.yml run 25031784091 failed with:
  ERROR: docker.io/alpine/openclaw:2026.4.25: not found

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ecfb9a6d15

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +198 to +202
if idx == 0 and environment_for_task:
# Same merge story for environment — keep CDK base entries
# (CHOKIDAR_USEPOLLING, CLAWHUB_WORKDIR) intact.
base_env = list(cd_copy.get("environment") or [])
cd_copy["environment"] = base_env + list(environment_for_task)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep CODEX_HOME when cloning task defs outside provisioning

chatgpt_oauth now depends on CODEX_HOME being present in the task definition, but this env merge only happens when environment_for_task is explicitly passed. Paths like resize_user_container still call _build_register_kwargs_from_base(...) without environment_for_task (see resize_user_container around lines 561-567), so a resize/re-register drops CODEX_HOME from the new revision. After that, OAuth users fall back to ~/.codex/auth.json and lose access to the EFS-staged credentials, so inference fails after a resize/redeploy cycle.

Useful? React with 👍 / 👎.

prez2307 added a commit that referenced this pull request Apr 28, 2026
…71624b4 (#409)

Verified in ECR:
  isol8/openclaw-extended:2026.4.25-slim-71624b4
  digest sha256:0409487c9c3b9d2bdcf2f5386357c852aa3041abc7afd40be444e2381ee14e4a
  pushed 2026-04-27 23:42:31 EDT

Built by build-openclaw-image run 25032408047 against main 71624b4
(the gh-via-apt fix from #408 on top of the 4.25-slim switch from #407
on top of the codex-auth env-var fix from #404).

This unblocks the deploy chain — CDK has been failing every cycle since
#404 because dev.tag pointed at the placeholder *-bootstrap value. Once
this PR merges, deploy.yml will pull the new image and the per-user
container task def will reference it on the next provision.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prez2307 added a commit that referenced this pull request Apr 28, 2026
…milies (#410)

The CDK base task def and per-user clones used to register into the same
ECS family (`isol8-{env}-openclaw`). #299 had to add an SSM-pinned ARN to
work around the resulting "describe-by-family returns a per-user clone"
problem. Then today (#404#409 deploy chain) we hit the OTHER edge of
that workaround: the ARN is injected into the backend via
`ecs.Secret.fromSsmParameter`, which resolves at TASK STARTUP only —
never refreshes. The running backend started with SSM=rev 1011 cached
and cloned per-user task defs from that stale revision through every
subsequent CDK deploy, producing a per-user task def 1016 that pulled
a placeholder image (`2026.4.25-bootstrap`) which doesn't exist in ECR.

Two underlying causes:
  1. Co-mingled families forced a workaround.
  2. The workaround cached its input at startup.

Fix the FAMILY problem and both downstream issues collapse:

* `EcsManager._build_register_kwargs_from_base` now describes the bare
  base family (e.g. `isol8-dev-openclaw`) — that family contains ONLY
  CDK base revisions because per-user clones go to `<base>-user`.
  ECS returns the latest base revision deterministically.

* Per-user clones register into `f"{base['family']}-user"` so they
  don't pollute the base family. Existing per-user task defs on the
  old family stay valid (they're full ARNs); they age out as users
  re-provision.

* Drop the SSM param + `ECS_TASK_DEFINITION` env var + the
  `ecs.Secret.fromSsmParameter` wiring + the cached `self._task_def`.
  Less code, no startup cache to go stale, no incident class.

* Drop the lingering CFN `exportValue` from container-stack (added in
  #299 to keep the cross-stack import alive across the SSM transition;
  no consumers now).

Tests updated:
  - `test_clones_task_def_with_access_point` asserts describe-by-family
    AND that the registered family is `<base>-user`.
  - `test_resize_reads_env_from_base_not_current` docstring updated
    to reflect the new mechanism.
  - All 1085 unit tests pass.
  - 11 pre-existing CDK Jest failures unrelated to this change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant