chore(infra): unstick prod pipeline — .family cross-stack, revert prod.tag, drop nonce#326
Conversation
…K_DEFINITION to .family, drop nonce PR #323's prod deploy rolled back and every subsequent deploy hits the same CloudFormation lock: isol8-prod-container can't update its OpenClawTaskDef export because isol8-prod-service imports it. The lock is checked against the consumer's live template, so no amount of pending diff on service-stack this run helps (PR #324's DEPLOY_NONCE theory was wrong — confirmed in today's failed deploy). Quick-fix to unblock and set up the extended-image rollout as two PRs: 1. openclaw-version.json: revert prod.tag to "bootstrap" so container- stack has no task-def diff on this deploy. Prod's base image stays at alpine/openclaw:2026.4.5 (upstream) — where it is now, so zero runtime regression. 2. service-stack.ts: swap ECS_TASK_DEFINITION from props.container.openclawTaskDef.taskDefinitionArn to props.container.openclawTaskDef.family. That's an inlined static string ("isol8-prod-openclaw"), not an Fn::ImportValue — the cross- stack coupling disappears. On this deploy: container no-ops, service updates, and the OpenClawTaskDef export becomes unused. 3. Drop DEPLOY_NONCE from PR #324 (unnecessary once the cross-stack coupling is gone). Follow-up PR re-bumps prod.tag to the extended image; with no consumer left on the export, CFN updates the task-def revision freely. Trade-off: .family reintroduces the "latest-in-family" lookup PR #299 moved away from. In practice safe under current code — CLAWHUB_WORKDIR is now on every clone (added in PR #277, inherited by all per-user clones since), and the per-user access point is always overridden by _build_register_kwargs_from_base, so cross-user leakage can't happen. A real fix (SSM-parameter indirection on the ARN) is worth doing later to restore revision-pinning without the cross-stack lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b5a26fa7d1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // longer recur under current code. If we later want the ARN-revision | ||
| // pinning back, route it through an SSM parameter so the value isn't | ||
| // tied to a consumer-imported export. | ||
| ECS_TASK_DEFINITION: props.container.openclawTaskDef.family, |
There was a problem hiding this comment.
Keep ECS task definition env pinned to revision ARN
Switching ECS_TASK_DEFINITION to openclawTaskDef.family makes the backend resolve describe_task_definition against family-latest instead of the CDK base revision, so any per-user revision that was most recently created (including ones from resize_user_container with custom new_cpu/new_memory/new_image) becomes the template for subsequently provisioned users. In apps/backend/core/containers/ecs_manager.py, _build_register_kwargs_from_base clones self._task_def and preserves base CPU/memory/image when overrides are absent, which means this change can silently propagate one tenant’s sizing/image choices to other tenants and reintroduce non-deterministic drift.
Useful? React with 👍 / 👎.
Previous attempt (#326) used props.container.openclawTaskDef.family but the prop type is ecs.ITaskDefinition, which doesn't expose .family — only the concrete TaskDefinition / FargateTaskDefinition classes do. Synth failed with TS2339. Inline the family string directly — it's defined in container-stack.ts as a literal template (isol8-\${env}-openclaw), so duplicating it here matches the source of truth with no cross-stack plumbing. If we rename the family in container-stack, grep catches this call site. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
PR #324's nonce theory was wrong — today's deploy failed again with the same
Cannot update export ... as it is in useerror. CFN checks the consumer's live template, not pending changes queued in the same run, so giving service-stack a nonce diff doesn't help.This is the PR-A half of a two-PR plan to roll the extended image out to everyone:
openclaw-version.jsonprod.tag→"bootstrap". Container-stack will have no task-def diff this deploy, so it can't hit the lock. Prod stays onalpine/openclaw:2026.4.5(upstream) — where every running task-def revision is now. Zero runtime regression.service-stack.ts:588from.taskDefinitionArn→.family. That's an inlined static string ("isol8-prod-openclaw"), not a cross-stackFn::ImportValue. The coupling is gone: after deploy,isol8-prod-container:ExportsOutputRefOpenClawTaskDef...has no consumers.DEPLOY_NONCE(PR chore(infra): unblock prod task-def export bump via DEPLOY_NONCE #324 leftover).Trade-off
.familyreintroduces the "latest-in-family" lookup PR #299 moved away from. Currently safe:CLAWHUB_WORKDIRis on every clone now (added in PR fix(skills): redirect clawhub installs + always enable ECS Exec #277, inherited by every subsequent per-user clone)._build_register_kwargs_from_base, so no cross-user leakage even if the backend clones from a per-user revision.Durable fix (put the revision ARN behind an SSM parameter so it's not a CFN-import) is worth doing later — filed as follow-up.
Follow-up PR
PR-B: re-bumpprod.tag→"2026.4.5-bf9f699". With no consumer importing the export, container-stack updates the task-def freely. New provisions land on the extended image (clawhub baked in). ThenPOST /container/updateswithowner_id:"all"rolls banners to every existing owner.Test plan
isol8-prod-containerdeploys (no-op on task-def, expected).isol8-prod-servicedeploys (template diff forECS_TASK_DEFINITIONenv, andDEPLOY_NONCEremoval).aws cloudformation list-imports --export-name isol8-prod-container:ExportsOutputRefOpenClawTaskDefDC1884BEC2B7400Areturns empty.ECS_TASK_DEFINITION=isol8-prod-openclaw(family),describe_task_definitionstill works on the family name.🤖 Generated with Claude Code