Skip to content

fix(containers): split CDK base + per-user task defs into separate families#410

Merged
prez2307 merged 1 commit into
mainfrom
fix/openclaw-task-def-family-split
Apr 28, 2026
Merged

fix(containers): split CDK base + per-user task defs into separate families#410
prez2307 merged 1 commit into
mainfrom
fix/openclaw-task-def-family-split

Conversation

@prez2307
Copy link
Copy Markdown
Contributor

Summary

Today's stale-image bug (per-user task def 1016 pointed at a non-existent placeholder image even after the deploy chain landed) was the second-order effect of a deeper architectural smell: CDK base and per-user task defs share an ECS family.

Two-bug onion

  1. Co-mingled families. Both CDK base revisions AND per-user clones register into isol8-{env}-openclaw. describe_task_definition('isol8-{env}-openclaw') returns whichever was registered last — usually a per-user clone. PR fix(containers): pin per-user task def base to ARN, never family-latest #299 worked around this by pinning to a specific ARN (with revision) via SSM.

  2. The SSM workaround caches at startup. Service-stack injected ECS_TASK_DEFINITION via ecs.Secret.fromSsmParameter, which ECS resolves once at task startup. The running backend started 23:13 EDT with SSM = rev 1011, and stayed pinned to that ARN through all subsequent CDK base bumps (rev 1012/1013/1014/1015). Today's user provisioning cloned from rev 1011 → produced rev 1016 with a broken placeholder image.

Fix the family problem and both layers collapse

  • EcsManager._build_register_kwargs_from_base now describe_task_definitions the bare base family. The base family is uncontaminated because per-user clones go to a separate <base>-user family. Latest base revision returns deterministically.
  • Per-user clones register into f"{base['family']}-user" (one-line change).
  • Delete the SSM param, ECS_TASK_DEFINITION env var, ecs.Secret.fromSsmParameter wiring, cached self._task_def, and the lingering CFN exportValue workaround. Net negative LoC.

Migration

Existing per-user task defs in isol8-dev-openclaw keep their full ARNs and continue to run. They age out naturally as users re-provision into isol8-dev-openclaw-user.

Test plan

  • pytest tests/unit — 1085 pass
  • npx tsc --noEmit clean
  • cdk synth dev/isol8-dev-service — confirmed ECS_TASK_DEFINITION and the SSM param are gone from synth
  • Test test_clones_task_def_with_access_point updated to assert describe-by-family AND register into <base>-user
  • Watch deploy + smoke-test ChatGPT OAuth on dev once new backend rolls

11 pre-existing CDK Jest failures unrelated to this change (verified by stashing and re-running on main).

🤖 Generated with Claude Code

…milies

The CDK base task def and per-user clones used to register into the same
ECS family (`isol8-{env}-openclaw`). #299 had to add an SSM-pinned ARN to
work around the resulting "describe-by-family returns a per-user clone"
problem. Then today (#404#409 deploy chain) we hit the OTHER edge of
that workaround: the ARN is injected into the backend via
`ecs.Secret.fromSsmParameter`, which resolves at TASK STARTUP only —
never refreshes. The running backend started with SSM=rev 1011 cached
and cloned per-user task defs from that stale revision through every
subsequent CDK deploy, producing a per-user task def 1016 that pulled
a placeholder image (`2026.4.25-bootstrap`) which doesn't exist in ECR.

Two underlying causes:
  1. Co-mingled families forced a workaround.
  2. The workaround cached its input at startup.

Fix the FAMILY problem and both downstream issues collapse:

* `EcsManager._build_register_kwargs_from_base` now describes the bare
  base family (e.g. `isol8-dev-openclaw`) — that family contains ONLY
  CDK base revisions because per-user clones go to `<base>-user`.
  ECS returns the latest base revision deterministically.

* Per-user clones register into `f"{base['family']}-user"` so they
  don't pollute the base family. Existing per-user task defs on the
  old family stay valid (they're full ARNs); they age out as users
  re-provision.

* Drop the SSM param + `ECS_TASK_DEFINITION` env var + the
  `ecs.Secret.fromSsmParameter` wiring + the cached `self._task_def`.
  Less code, no startup cache to go stale, no incident class.

* Drop the lingering CFN `exportValue` from container-stack (added in
  #299 to keep the cross-stack import alive across the SSM transition;
  no consumers now).

Tests updated:
  - `test_clones_task_def_with_access_point` asserts describe-by-family
    AND that the registered family is `<base>-user`.
  - `test_resize_reads_env_from_base_not_current` docstring updated
    to reflect the new mechanism.
  - All 1085 unit tests pass.
  - 11 pre-existing CDK Jest failures unrelated to this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prez2307 prez2307 merged commit ae18f88 into main Apr 28, 2026
1 check passed
@prez2307 prez2307 deleted the fix/openclaw-task-def-family-split branch April 28, 2026 21:42
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6547f0c09e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +176 to +178
desc_resp = self._ecs.describe_task_definition(
taskDefinition=self._base_task_def_family,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard base lookup from legacy mixed-family revisions

_build_register_kwargs_from_base now resolves the source definition via describe_task_definition on the bare family name, but existing environments can still have pre-#410 per-user revisions in isol8-<env>-openclaw from when clones were registered into the same family. In that migration window, ECS may return a per-user revision as “latest”, and this method copies that revision’s secrets/environment into new registrations, which can propagate stale or user-specific config (including BYOK secret ARNs) to other users until a new CDK base revision is published. The previous pinned-ARN path avoided this; this call needs a migration-safe guard (e.g., explicit base ARN/tag filter or bootstrap cutover) before relying on family-latest.

Useful? React with 👍 / 👎.

prez2307 added a commit that referenced this pull request May 23, 2026
Strip ephemeral review-cycle tags (Codex P1/P2 on PR #NNN, dated
incident narration like "2026-04-20 incident" / "2026-05-11 wedge",
deferred-to-future-PR canary TODOs, tracker codes like "W38 removed")
from CDK stacks and tests. Historical narration ("X removed in #410.
Backend now does Y") is rewritten as forward-looking statements of
the current design. Constraint rationales (cross-stack KMS posture,
NFS quirks, NLB-V1-for-WS-API) preserved verbatim.

11 files, +63/-100 lines. No code changes; pnpm test passes (one
pre-existing failure in marketplace-resources.test.ts is unrelated).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant