Skip to content

fix(containers): pin per-user task def base to ARN, never family-latest#299

Merged
prez2307 merged 1 commit into
mainfrom
fix/task-def-base-pin
Apr 18, 2026
Merged

fix(containers): pin per-user task def base to ARN, never family-latest#299
prez2307 merged 1 commit into
mainfrom
fix/task-def-base-pin

Conversation

@prez2307
Copy link
Copy Markdown
Contributor

Summary

Stops per-user ECS task defs from drifting from the CDK-managed base. Two-part structural fix:

  1. CDK pins the base ARN. `container-stack.ts` exposes `openclawTaskDef` as a public readonly. `service-stack.ts` sets `ECS_TASK_DEFINITION` to the full ARN with revision (`props.container.openclawTaskDef.taskDefinitionArn`) instead of the family name. CDK deploys auto-roll the backend with the new ARN.

  2. Backend cloner always reads from base. Both `_register_task_definition` (initial provision) AND `resize_user_container` (Update Now / tier change) now use a shared helper `_build_register_kwargs_from_base` that reads container-definitions / env / mounts from `self._task_def` (the pinned ARN). Per-user state — EFS access point, image override, cpu/memory — is layered on top.

Why

Tonight's CLAWHUB_WORKDIR incident: per-user task defs from `:953` onward were missing the env var even though CDK base `:955` had it. Two paths leaked:

  • `_register_task_definition` read `describe_task_definition("isol8-dev-openclaw")` (family name) → ECS resolves to family-latest → which is a per-user clone, not CDK base. Adding env vars to CDK silently failed to propagate once a single per-user clone existed.
  • `resize_user_container` read from the user's CURRENT task def → any drift stayed forever. Update Now couldn't self-heal.

After this fix, all clones read from a fixed ARN that only CDK can change, and resize always re-syncs to the current base.

Test plan

  • Backend: 747 passed
  • CDK: `tsc --noEmit` clean
  • New tests assert CLAWHUB_WORKDIR survives both clone paths
  • After merge + deploy: re-provision a user (`DELETE /debug/provision` + `POST /debug/provision`) and verify the new task def has CLAWHUB_WORKDIR
  • After merge + deploy: click Update Now on a drifted user and verify the resulting task def has CLAWHUB_WORKDIR (self-heal)

Known follow-ups (out of scope here)

  • Existing drifted per-user task defs (`:953`, `:956`, `:957` on dev) keep their bug until re-provisioned or resized.
  • Surface clawhub install paths in container logs so silent misroutes are detectable. Worth filing.

🤖 Generated with Claude Code

Per-user task defs were drifting from the CDK base because:

1. CDK and per-user clones registered into the SAME ECS family. ECS resolves
   "family without revision" to family-latest, which can be a per-user clone
   instead of the CDK base. Adding env vars to the CDK base no longer
   propagates to new per-user clones once any per-user revision exists.

2. resize_user_container read from the user's CURRENT per-user task def, so
   any drift on that user's task def stayed forever — Update Now couldn't
   self-heal.

Result on dev (incident 2026-04-17): every per-user task def from :953
onward was missing CLAWHUB_WORKDIR, even though the CDK base had it. Every
clawhub install landed at /home/node/.openclaw/workspaces/skills/<slug> —
a path no agent scanner reads.

Fix:

- container-stack.ts exposes openclawTaskDef as a public readonly so other
  stacks can reference it.
- service-stack.ts pins the backend's ECS_TASK_DEFINITION env var to the
  full task def ARN (with revision) instead of the family name. New CDK
  deploys auto-roll the backend with the new ARN.
- ecs_manager.py extracts a single _build_register_kwargs_from_base helper
  that ALWAYS reads from self._task_def. Both _register_task_definition
  (initial provision) AND resize_user_container use it. resize_user_container
  now reads access_point_id from DDB instead of from the user's prior
  task def.

Tests:
- New TestRegisterTaskDefinition::test_register_preserves_env_vars_from_base
  proves CLAWHUB_WORKDIR survives the cloner.
- New TestResizeUserContainer::test_resize_reads_env_from_base_not_current
  proves resize self-heals when a user has a drifted task def.
- _make_container_dict default now includes access_point_id (resize
  requires it). Tests that verify "no per-user resources" path explicitly
  pass access_point_id=None.

Backend tests: 747 passed.
CDK tsc --noEmit: clean.

Note: this fix affects future provisions/resizes. Existing users with
drifted per-user task defs need to either re-provision (delete + POST
/debug/provision) OR click Update Now (which now self-heals via the new
resize path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prez2307 prez2307 merged commit 808d580 into main Apr 18, 2026
1 check passed
@prez2307 prez2307 deleted the fix/task-def-base-pin branch April 18, 2026 03:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d44211cd37

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +174 to +177
for cd in base["containerDefinitions"]:
cd_copy = dict(cd)
if new_image:
cd_copy["image"] = new_image
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve current image on CPU/memory-only resize

resize_user_container still documents new_image=None as “keep current”, but this helper now always clones containerDefinitions from the pinned base task definition and only rewrites image when new_image is provided. That means a user who previously received an image_update (via apply_update with new_image) will be rolled back to the base image on a later tier resize (new_cpu/new_memory only), which can silently undo hotfix or rollout images.

Useful? React with 👍 / 👎.

prez2307 added a commit that referenced this pull request Apr 21, 2026
…K_DEFINITION to .family, drop nonce (#326)

PR #323's prod deploy rolled back and every subsequent deploy hits the
same CloudFormation lock: isol8-prod-container can't update its
OpenClawTaskDef export because isol8-prod-service imports it. The lock
is checked against the consumer's live template, so no amount of pending
diff on service-stack this run helps (PR #324's DEPLOY_NONCE theory was
wrong — confirmed in today's failed deploy).

Quick-fix to unblock and set up the extended-image rollout as two PRs:

1. openclaw-version.json: revert prod.tag to "bootstrap" so container-
   stack has no task-def diff on this deploy. Prod's base image stays
   at alpine/openclaw:2026.4.5 (upstream) — where it is now, so zero
   runtime regression.
2. service-stack.ts: swap ECS_TASK_DEFINITION from
   props.container.openclawTaskDef.taskDefinitionArn to
   props.container.openclawTaskDef.family. That's an inlined static
   string ("isol8-prod-openclaw"), not an Fn::ImportValue — the cross-
   stack coupling disappears. On this deploy: container no-ops, service
   updates, and the OpenClawTaskDef export becomes unused.
3. Drop DEPLOY_NONCE from PR #324 (unnecessary once the cross-stack
   coupling is gone).

Follow-up PR re-bumps prod.tag to the extended image; with no consumer
left on the export, CFN updates the task-def revision freely.

Trade-off: .family reintroduces the "latest-in-family" lookup PR #299
moved away from. In practice safe under current code — CLAWHUB_WORKDIR
is now on every clone (added in PR #277, inherited by all per-user
clones since), and the per-user access point is always overridden by
_build_register_kwargs_from_base, so cross-user leakage can't happen.
A real fix (SSM-parameter indirection on the ARN) is worth doing later
to restore revision-pinning without the cross-stack lock.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prez2307 added a commit that referenced this pull request Apr 28, 2026
…milies (#410)

The CDK base task def and per-user clones used to register into the same
ECS family (`isol8-{env}-openclaw`). #299 had to add an SSM-pinned ARN to
work around the resulting "describe-by-family returns a per-user clone"
problem. Then today (#404#409 deploy chain) we hit the OTHER edge of
that workaround: the ARN is injected into the backend via
`ecs.Secret.fromSsmParameter`, which resolves at TASK STARTUP only —
never refreshes. The running backend started with SSM=rev 1011 cached
and cloned per-user task defs from that stale revision through every
subsequent CDK deploy, producing a per-user task def 1016 that pulled
a placeholder image (`2026.4.25-bootstrap`) which doesn't exist in ECR.

Two underlying causes:
  1. Co-mingled families forced a workaround.
  2. The workaround cached its input at startup.

Fix the FAMILY problem and both downstream issues collapse:

* `EcsManager._build_register_kwargs_from_base` now describes the bare
  base family (e.g. `isol8-dev-openclaw`) — that family contains ONLY
  CDK base revisions because per-user clones go to `<base>-user`.
  ECS returns the latest base revision deterministically.

* Per-user clones register into `f"{base['family']}-user"` so they
  don't pollute the base family. Existing per-user task defs on the
  old family stay valid (they're full ARNs); they age out as users
  re-provision.

* Drop the SSM param + `ECS_TASK_DEFINITION` env var + the
  `ecs.Secret.fromSsmParameter` wiring + the cached `self._task_def`.
  Less code, no startup cache to go stale, no incident class.

* Drop the lingering CFN `exportValue` from container-stack (added in
  #299 to keep the cross-stack import alive across the SSM transition;
  no consumers now).

Tests updated:
  - `test_clones_task_def_with_access_point` asserts describe-by-family
    AND that the registered family is `<base>-user`.
  - `test_resize_reads_env_from_base_not_current` docstring updated
    to reflect the new mechanism.
  - All 1085 unit tests pass.
  - 11 pre-existing CDK Jest failures unrelated to this change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant