fix(containers): split CDK base + per-user task defs into separate families#410
Conversation
…milies
The CDK base task def and per-user clones used to register into the same
ECS family (`isol8-{env}-openclaw`). #299 had to add an SSM-pinned ARN to
work around the resulting "describe-by-family returns a per-user clone"
problem. Then today (#404→#409 deploy chain) we hit the OTHER edge of
that workaround: the ARN is injected into the backend via
`ecs.Secret.fromSsmParameter`, which resolves at TASK STARTUP only —
never refreshes. The running backend started with SSM=rev 1011 cached
and cloned per-user task defs from that stale revision through every
subsequent CDK deploy, producing a per-user task def 1016 that pulled
a placeholder image (`2026.4.25-bootstrap`) which doesn't exist in ECR.
Two underlying causes:
1. Co-mingled families forced a workaround.
2. The workaround cached its input at startup.
Fix the FAMILY problem and both downstream issues collapse:
* `EcsManager._build_register_kwargs_from_base` now describes the bare
base family (e.g. `isol8-dev-openclaw`) — that family contains ONLY
CDK base revisions because per-user clones go to `<base>-user`.
ECS returns the latest base revision deterministically.
* Per-user clones register into `f"{base['family']}-user"` so they
don't pollute the base family. Existing per-user task defs on the
old family stay valid (they're full ARNs); they age out as users
re-provision.
* Drop the SSM param + `ECS_TASK_DEFINITION` env var + the
`ecs.Secret.fromSsmParameter` wiring + the cached `self._task_def`.
Less code, no startup cache to go stale, no incident class.
* Drop the lingering CFN `exportValue` from container-stack (added in
#299 to keep the cross-stack import alive across the SSM transition;
no consumers now).
Tests updated:
- `test_clones_task_def_with_access_point` asserts describe-by-family
AND that the registered family is `<base>-user`.
- `test_resize_reads_env_from_base_not_current` docstring updated
to reflect the new mechanism.
- All 1085 unit tests pass.
- 11 pre-existing CDK Jest failures unrelated to this change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6547f0c09e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| desc_resp = self._ecs.describe_task_definition( | ||
| taskDefinition=self._base_task_def_family, | ||
| ) |
There was a problem hiding this comment.
Guard base lookup from legacy mixed-family revisions
_build_register_kwargs_from_base now resolves the source definition via describe_task_definition on the bare family name, but existing environments can still have pre-#410 per-user revisions in isol8-<env>-openclaw from when clones were registered into the same family. In that migration window, ECS may return a per-user revision as “latest”, and this method copies that revision’s secrets/environment into new registrations, which can propagate stale or user-specific config (including BYOK secret ARNs) to other users until a new CDK base revision is published. The previous pinned-ARN path avoided this; this call needs a migration-safe guard (e.g., explicit base ARN/tag filter or bootstrap cutover) before relying on family-latest.
Useful? React with 👍 / 👎.
Strip ephemeral review-cycle tags (Codex P1/P2 on PR #NNN, dated
incident narration like "2026-04-20 incident" / "2026-05-11 wedge",
deferred-to-future-PR canary TODOs, tracker codes like "W38 removed")
from CDK stacks and tests. Historical narration ("X removed in #410.
Backend now does Y") is rewritten as forward-looking statements of
the current design. Constraint rationales (cross-stack KMS posture,
NFS quirks, NLB-V1-for-WS-API) preserved verbatim.
11 files, +63/-100 lines. No code changes; pnpm test passes (one
pre-existing failure in marketplace-resources.test.ts is unrelated).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Today's stale-image bug (per-user task def 1016 pointed at a non-existent placeholder image even after the deploy chain landed) was the second-order effect of a deeper architectural smell: CDK base and per-user task defs share an ECS family.
Two-bug onion
Co-mingled families. Both CDK base revisions AND per-user clones register into
isol8-{env}-openclaw.describe_task_definition('isol8-{env}-openclaw')returns whichever was registered last — usually a per-user clone. PR fix(containers): pin per-user task def base to ARN, never family-latest #299 worked around this by pinning to a specific ARN (with revision) via SSM.The SSM workaround caches at startup. Service-stack injected
ECS_TASK_DEFINITIONviaecs.Secret.fromSsmParameter, which ECS resolves once at task startup. The running backend started 23:13 EDT with SSM = rev 1011, and stayed pinned to that ARN through all subsequent CDK base bumps (rev 1012/1013/1014/1015). Today's user provisioning cloned from rev 1011 → produced rev 1016 with a broken placeholder image.Fix the family problem and both layers collapse
EcsManager._build_register_kwargs_from_basenowdescribe_task_definitions the bare base family. The base family is uncontaminated because per-user clones go to a separate<base>-userfamily. Latest base revision returns deterministically.f"{base['family']}-user"(one-line change).ECS_TASK_DEFINITIONenv var,ecs.Secret.fromSsmParameterwiring, cachedself._task_def, and the lingering CFNexportValueworkaround. Net negative LoC.Migration
Existing per-user task defs in
isol8-dev-openclawkeep their full ARNs and continue to run. They age out naturally as users re-provision intoisol8-dev-openclaw-user.Test plan
pytest tests/unit— 1085 passnpx tsc --noEmitcleancdk synth dev/isol8-dev-service— confirmedECS_TASK_DEFINITIONand the SSM param are gone from synthtest_clones_task_def_with_access_pointupdated to assert describe-by-family AND register into<base>-user11 pre-existing CDK Jest failures unrelated to this change (verified by stashing and re-running on main).
🤖 Generated with Claude Code