Skip to content

chore(infra): unstick prod pipeline — .family cross-stack, revert prod.tag, drop nonce#326

Merged
prez2307 merged 1 commit into
mainfrom
chore/revert-prod-tag-and-nonce
Apr 21, 2026
Merged

chore(infra): unstick prod pipeline — .family cross-stack, revert prod.tag, drop nonce#326
prez2307 merged 1 commit into
mainfrom
chore/revert-prod-tag-and-nonce

Conversation

@prez2307
Copy link
Copy Markdown
Contributor

Summary

PR #324's nonce theory was wrong — today's deploy failed again with the same Cannot update export ... as it is in use error. CFN checks the consumer's live template, not pending changes queued in the same run, so giving service-stack a nonce diff doesn't help.

This is the PR-A half of a two-PR plan to roll the extended image out to everyone:

  1. Revert openclaw-version.json prod.tag"bootstrap". Container-stack will have no task-def diff this deploy, so it can't hit the lock. Prod stays on alpine/openclaw:2026.4.5 (upstream) — where every running task-def revision is now. Zero runtime regression.
  2. Swap service-stack.ts:588 from .taskDefinitionArn.family. That's an inlined static string ("isol8-prod-openclaw"), not a cross-stack Fn::ImportValue. The coupling is gone: after deploy, isol8-prod-container:ExportsOutputRefOpenClawTaskDef... has no consumers.
  3. Drop DEPLOY_NONCE (PR chore(infra): unblock prod task-def export bump via DEPLOY_NONCE #324 leftover).

Trade-off

.family reintroduces the "latest-in-family" lookup PR #299 moved away from. Currently safe:

Durable fix (put the revision ARN behind an SSM parameter so it's not a CFN-import) is worth doing later — filed as follow-up.

Follow-up PR

PR-B: re-bump prod.tag"2026.4.5-bf9f699". With no consumer importing the export, container-stack updates the task-def freely. New provisions land on the extended image (clawhub baked in). Then POST /container/updates with owner_id:"all" rolls banners to every existing owner.

Test plan

  • isol8-prod-container deploys (no-op on task-def, expected).
  • isol8-prod-service deploys (template diff for ECS_TASK_DEFINITION env, and DEPLOY_NONCE removal).
  • After deploy, aws cloudformation list-imports --export-name isol8-prod-container:ExportsOutputRefOpenClawTaskDefDC1884BEC2B7400A returns empty.
  • Backend reads ECS_TASK_DEFINITION=isol8-prod-openclaw (family), describe_task_definition still works on the family name.

🤖 Generated with Claude Code

…K_DEFINITION to .family, drop nonce

PR #323's prod deploy rolled back and every subsequent deploy hits the
same CloudFormation lock: isol8-prod-container can't update its
OpenClawTaskDef export because isol8-prod-service imports it. The lock
is checked against the consumer's live template, so no amount of pending
diff on service-stack this run helps (PR #324's DEPLOY_NONCE theory was
wrong — confirmed in today's failed deploy).

Quick-fix to unblock and set up the extended-image rollout as two PRs:

1. openclaw-version.json: revert prod.tag to "bootstrap" so container-
   stack has no task-def diff on this deploy. Prod's base image stays
   at alpine/openclaw:2026.4.5 (upstream) — where it is now, so zero
   runtime regression.
2. service-stack.ts: swap ECS_TASK_DEFINITION from
   props.container.openclawTaskDef.taskDefinitionArn to
   props.container.openclawTaskDef.family. That's an inlined static
   string ("isol8-prod-openclaw"), not an Fn::ImportValue — the cross-
   stack coupling disappears. On this deploy: container no-ops, service
   updates, and the OpenClawTaskDef export becomes unused.
3. Drop DEPLOY_NONCE from PR #324 (unnecessary once the cross-stack
   coupling is gone).

Follow-up PR re-bumps prod.tag to the extended image; with no consumer
left on the export, CFN updates the task-def revision freely.

Trade-off: .family reintroduces the "latest-in-family" lookup PR #299
moved away from. In practice safe under current code — CLAWHUB_WORKDIR
is now on every clone (added in PR #277, inherited by all per-user
clones since), and the per-user access point is always overridden by
_build_register_kwargs_from_base, so cross-user leakage can't happen.
A real fix (SSM-parameter indirection on the ARN) is worth doing later
to restore revision-pinning without the cross-stack lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5a26fa7d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// longer recur under current code. If we later want the ARN-revision
// pinning back, route it through an SSM parameter so the value isn't
// tied to a consumer-imported export.
ECS_TASK_DEFINITION: props.container.openclawTaskDef.family,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep ECS task definition env pinned to revision ARN

Switching ECS_TASK_DEFINITION to openclawTaskDef.family makes the backend resolve describe_task_definition against family-latest instead of the CDK base revision, so any per-user revision that was most recently created (including ones from resize_user_container with custom new_cpu/new_memory/new_image) becomes the template for subsequently provisioned users. In apps/backend/core/containers/ecs_manager.py, _build_register_kwargs_from_base clones self._task_def and preserves base CPU/memory/image when overrides are absent, which means this change can silently propagate one tenant’s sizing/image choices to other tenants and reintroduce non-deterministic drift.

Useful? React with 👍 / 👎.

@prez2307 prez2307 merged commit 3914e78 into main Apr 21, 2026
prez2307 added a commit that referenced this pull request Apr 21, 2026
Previous attempt (#326) used props.container.openclawTaskDef.family but
the prop type is ecs.ITaskDefinition, which doesn't expose .family — only
the concrete TaskDefinition / FargateTaskDefinition classes do. Synth
failed with TS2339.

Inline the family string directly — it's defined in container-stack.ts
as a literal template (isol8-\${env}-openclaw), so duplicating it here
matches the source of truth with no cross-stack plumbing. If we rename
the family in container-stack, grep catches this call site.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant