fix(containers): pin per-user task def base to ARN, never family-latest#299
Conversation
Per-user task defs were drifting from the CDK base because: 1. CDK and per-user clones registered into the SAME ECS family. ECS resolves "family without revision" to family-latest, which can be a per-user clone instead of the CDK base. Adding env vars to the CDK base no longer propagates to new per-user clones once any per-user revision exists. 2. resize_user_container read from the user's CURRENT per-user task def, so any drift on that user's task def stayed forever — Update Now couldn't self-heal. Result on dev (incident 2026-04-17): every per-user task def from :953 onward was missing CLAWHUB_WORKDIR, even though the CDK base had it. Every clawhub install landed at /home/node/.openclaw/workspaces/skills/<slug> — a path no agent scanner reads. Fix: - container-stack.ts exposes openclawTaskDef as a public readonly so other stacks can reference it. - service-stack.ts pins the backend's ECS_TASK_DEFINITION env var to the full task def ARN (with revision) instead of the family name. New CDK deploys auto-roll the backend with the new ARN. - ecs_manager.py extracts a single _build_register_kwargs_from_base helper that ALWAYS reads from self._task_def. Both _register_task_definition (initial provision) AND resize_user_container use it. resize_user_container now reads access_point_id from DDB instead of from the user's prior task def. Tests: - New TestRegisterTaskDefinition::test_register_preserves_env_vars_from_base proves CLAWHUB_WORKDIR survives the cloner. - New TestResizeUserContainer::test_resize_reads_env_from_base_not_current proves resize self-heals when a user has a drifted task def. - _make_container_dict default now includes access_point_id (resize requires it). Tests that verify "no per-user resources" path explicitly pass access_point_id=None. Backend tests: 747 passed. CDK tsc --noEmit: clean. Note: this fix affects future provisions/resizes. Existing users with drifted per-user task defs need to either re-provision (delete + POST /debug/provision) OR click Update Now (which now self-heals via the new resize path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d44211cd37
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for cd in base["containerDefinitions"]: | ||
| cd_copy = dict(cd) | ||
| if new_image: | ||
| cd_copy["image"] = new_image |
There was a problem hiding this comment.
Preserve current image on CPU/memory-only resize
resize_user_container still documents new_image=None as “keep current”, but this helper now always clones containerDefinitions from the pinned base task definition and only rewrites image when new_image is provided. That means a user who previously received an image_update (via apply_update with new_image) will be rolled back to the base image on a later tier resize (new_cpu/new_memory only), which can silently undo hotfix or rollout images.
Useful? React with 👍 / 👎.
…K_DEFINITION to .family, drop nonce (#326) PR #323's prod deploy rolled back and every subsequent deploy hits the same CloudFormation lock: isol8-prod-container can't update its OpenClawTaskDef export because isol8-prod-service imports it. The lock is checked against the consumer's live template, so no amount of pending diff on service-stack this run helps (PR #324's DEPLOY_NONCE theory was wrong — confirmed in today's failed deploy). Quick-fix to unblock and set up the extended-image rollout as two PRs: 1. openclaw-version.json: revert prod.tag to "bootstrap" so container- stack has no task-def diff on this deploy. Prod's base image stays at alpine/openclaw:2026.4.5 (upstream) — where it is now, so zero runtime regression. 2. service-stack.ts: swap ECS_TASK_DEFINITION from props.container.openclawTaskDef.taskDefinitionArn to props.container.openclawTaskDef.family. That's an inlined static string ("isol8-prod-openclaw"), not an Fn::ImportValue — the cross- stack coupling disappears. On this deploy: container no-ops, service updates, and the OpenClawTaskDef export becomes unused. 3. Drop DEPLOY_NONCE from PR #324 (unnecessary once the cross-stack coupling is gone). Follow-up PR re-bumps prod.tag to the extended image; with no consumer left on the export, CFN updates the task-def revision freely. Trade-off: .family reintroduces the "latest-in-family" lookup PR #299 moved away from. In practice safe under current code — CLAWHUB_WORKDIR is now on every clone (added in PR #277, inherited by all per-user clones since), and the per-user access point is always overridden by _build_register_kwargs_from_base, so cross-user leakage can't happen. A real fix (SSM-parameter indirection on the ARN) is worth doing later to restore revision-pinning without the cross-stack lock. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…milies (#410) The CDK base task def and per-user clones used to register into the same ECS family (`isol8-{env}-openclaw`). #299 had to add an SSM-pinned ARN to work around the resulting "describe-by-family returns a per-user clone" problem. Then today (#404→#409 deploy chain) we hit the OTHER edge of that workaround: the ARN is injected into the backend via `ecs.Secret.fromSsmParameter`, which resolves at TASK STARTUP only — never refreshes. The running backend started with SSM=rev 1011 cached and cloned per-user task defs from that stale revision through every subsequent CDK deploy, producing a per-user task def 1016 that pulled a placeholder image (`2026.4.25-bootstrap`) which doesn't exist in ECR. Two underlying causes: 1. Co-mingled families forced a workaround. 2. The workaround cached its input at startup. Fix the FAMILY problem and both downstream issues collapse: * `EcsManager._build_register_kwargs_from_base` now describes the bare base family (e.g. `isol8-dev-openclaw`) — that family contains ONLY CDK base revisions because per-user clones go to `<base>-user`. ECS returns the latest base revision deterministically. * Per-user clones register into `f"{base['family']}-user"` so they don't pollute the base family. Existing per-user task defs on the old family stay valid (they're full ARNs); they age out as users re-provision. * Drop the SSM param + `ECS_TASK_DEFINITION` env var + the `ecs.Secret.fromSsmParameter` wiring + the cached `self._task_def`. Less code, no startup cache to go stale, no incident class. * Drop the lingering CFN `exportValue` from container-stack (added in #299 to keep the cross-stack import alive across the SSM transition; no consumers now). Tests updated: - `test_clones_task_def_with_access_point` asserts describe-by-family AND that the registered family is `<base>-user`. - `test_resize_reads_env_from_base_not_current` docstring updated to reflect the new mechanism. - All 1085 unit tests pass. - 11 pre-existing CDK Jest failures unrelated to this change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Stops per-user ECS task defs from drifting from the CDK-managed base. Two-part structural fix:
CDK pins the base ARN. `container-stack.ts` exposes `openclawTaskDef` as a public readonly. `service-stack.ts` sets `ECS_TASK_DEFINITION` to the full ARN with revision (`props.container.openclawTaskDef.taskDefinitionArn`) instead of the family name. CDK deploys auto-roll the backend with the new ARN.
Backend cloner always reads from base. Both `_register_task_definition` (initial provision) AND `resize_user_container` (Update Now / tier change) now use a shared helper `_build_register_kwargs_from_base` that reads container-definitions / env / mounts from `self._task_def` (the pinned ARN). Per-user state — EFS access point, image override, cpu/memory — is layered on top.
Why
Tonight's CLAWHUB_WORKDIR incident: per-user task defs from `:953` onward were missing the env var even though CDK base `:955` had it. Two paths leaked:
After this fix, all clones read from a fixed ARN that only CDK can change, and resize always re-syncs to the current base.
Test plan
Known follow-ups (out of scope here)
🤖 Generated with Claude Code