feat: Cloudflare staging migration — TF bootstrap + wrangler scaffold + per-env apply#32
Merged
Merged
Conversation
First Terraform module for managing Cloudflare resources. Mirrors the pattern of infra/newrelic/terraform/ (CF provider v5, env-driven creds, explicit `terraform apply` per CLAUDE.md rule 15). Resources: - Two cloudflare_account_token resources (deploy + admin/tunnel) with scoped permission_groups, account-bound to CF for Startups account 613a9e74136364c781a8e258326019f9. - cloudflare_r2_bucket "instant-shared" (or "-staging") with 24h-TTL lifecycle on the anon/ prefix (matches platform anon-resource TTL). - DNS records for apex/www/api/staging at TTL=60 per the cutover ramp ritual (CF-migration DECISIONS.md D-3). - cloudflare_pages_project for instanode-web (Phase 2). Dashboard-on-Pages is explicitly NOT here (D-5 kill). - Sensitive token outputs for the `make install-secrets` helper. State backend: R2 (S3-compatible), bucket "instanode-tf-state". Operator must create the bucket + HMAC creds out-of-band before `terraform init` (see README §Bootstrap). Per-env workspaces (staging/production) selected via terraform workspace. production.auto.tfvars and staging.auto.tfvars commit-safe (no secrets). D-N decisions implemented: - D-1 (scope), D-2 (staging on full CF stack), D-3 (per-svc DNS-weighted cutover), D-5 (no dashboard-on-Pages), D-7 (NS delegation confirmed CF), D-8 (canonical R2 env-var names) Verified locally with `terraform fmt -check -recursive` + `terraform validate` against provider v5.19.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three workflows for the CF-migration TF module: 1. terraform.yml — plan-on-PR for BOTH staging and production. Matrix over the two envs, posts plan diff as a PR comment, never applies. 2. terraform-apply-staging.yml — workflow_dispatch only. Confirm phrase `APPLY-STAGING`. GH Environment binding "staging". 3. terraform-apply-production.yml — workflow_dispatch only. STRICTER: confirm phrase `APPLY-PRODUCTION` + numeric `staging_run_id` validated as digits-only + `gh run view` confirms the matching staging apply actually succeeded. GH Environment binding "production" gates on required reviewers. There is NO auto-promotion path from staging to production. Each prod apply is a separate human-triggered run that must clear all three gates. Security: every GHA expression consumed in run: blocks is wrapped through env: to prevent script injection. No client_payload usage. No `ref:` parameter on any checkout. Mirrors infra's existing validate.yml/apply.yml split for k8s manifests (rule 15: state-changing operations require deliberate human trigger). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the 2026-05-30 user direction: staging is CF-only, ephemeral state acceptable. The cloudflare/cloudflare TF provider does NOT yet expose cloudflare_container (verified: provider v5.19.1), so Containers are managed by wrangler while TF handles everything else. 8 services scaffolded under wrangler/: - api / worker / provisioner: wraps GHCR images (built by their repos' CI with the new :staging tag) in a CF Container via a tiny Worker shell (src/worker.ts) that forwards requests to the Container DO. - pg-platform: custom image (Dockerfile here) that wraps pgvector/pgvector:pg16 and bakes all 63 platform migrations from api/internal/db/migrations/ into /docker-entrypoint-initdb.d/. Cold starts re-apply migrations because CF Containers wipe disk on sleep. - pg-customers / mongodb / redis-provision / nats: per-tenant Containers (idFromName(tenant)) backing /db/new, /nosql/new, /cache/new, /queue/new in staging. - mongodb / redis-provision / nats: custom images that wrap the upstream image with healthcheck + staging-suitable config (Redis with auth + LRU eviction + RDB/AOF disabled, NATS with core-only no-JetStream legacy_open auth). The ephemeral-state acceptance criterion is documented in wrangler/README.md. E2E tests must seed their own fixtures; no "deploy then verify 2h later" tests survive Container sleep. Two new workflows: - wrangler-build-staging-images.yml: builds the 4 custom images (pg-platform with cross-repo api checkout for migrations; mongodb/ redis-provision/nats single-repo). Triggers: workflow_dispatch, push to wrangler/<svc>/**, daily cron 09:00 UTC, repository_dispatch "migrations-changed" event from the api repo. - wrangler-deploy-staging.yml: workflow_dispatch only. Service input validated against a whitelist before any shell use. Confirm phrase `DEPLOY-STAGING`. Production does NOT use this dir — production target is unsettled and will get a separate workflow when chosen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 30, 2026
Before: terraform init failed with a cryptic AWS-IAM/IMDS stack trace when CLOUDFLARE_API_TOKEN / TF_STATE_R2_* / CF_ACCOUNT_ID secrets were empty (the bootstrap chicken-and-egg documented in terraform/cloudflare/README.md §Bootstrap). After: a pre-flight step on the plan + apply jobs detects the empty secrets and fails with a one-line message naming the missing variables and linking to the README. PR still goes red (correct — operator action IS required) but the cause is now obvious from the CI log instead of buried inside terraform's S3 backend init. Applied identically to terraform.yml (plan), terraform-apply-staging.yml, terraform-apply-production.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to f8afb4c — the previous commit added the guard only to terraform.yml (the plan-on-PR job); the two apply workflows (staging and production) had the same chicken-and-egg failure mode and need the same guard so operator gets a clear message instead of a cryptic AWS IAM stack trace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the per-env DNS records and CF cache rules needed for the
staging.instanode.dev environment that the wrangler scaffold deploys
into. Both files count-gate on var.environment so the production
workspace plan shows no changes from this commit.
staging.tf — new file, 5 DNS records + 4 Workers routes + 1 Pages
custom-domain:
- *.staging.instanode.dev (wildcard CNAME, proxied) — per-tenant
service catchall; routed below
- *.deployment.staging.instanode.dev — mirror of prod *.deployment.*
for /deploy/new staging apps
- deployment.staging.instanode.dev — anchor for the wildcard above
- webhook.staging.instanode.dev — /webhook/new receiver subdomain
(separate host so customers can
filter outbound by destination)
- dashboard.staging.instanode.dev — QA-only dashboard (D-5 still
keeps prod dashboard off Pages)
Plus 4 cloudflare_workers_route resources binding the per-tenant
wildcards to the right Worker:
pg-customer-*.staging.* → instanode-pg-customers-staging
mongo-*.staging.* → instanode-mongodb-staging
redis-*.staging.* → instanode-redis-provision-staging
nats-*.staging.* → instanode-nats-staging
Plus a cloudflare_pages_domain attaching staging.instanode.dev to the
instanode-web-staging Pages project (the project itself is in pages.tf).
api.staging.instanode.dev is NOT in this file — wrangler claims it via
`custom_domain = true` in infra/wrangler/api/wrangler.toml. TF and
wrangler must not both manage the same DNS or wrangler deploy fails
with "DNS record already exists".
cache.tf — new file, implements D-12 (LOCKED) cache scope. Single
cloudflare_ruleset on http_request_cache_settings phase with 4 rules:
Rule 1: bypass cache for everything on api*.instanode.dev by default
Rule 2: cache /healthz at edge for 30s (SHA same across instances)
Rule 3: cache /openapi.json for 5min (frequent re-fetch by tooling)
Rule 4: cache /llms.txt for 1h (static, manual sync from content)
NO Authorization-header bypass (the original design — primitive doesn't
exist on our zone tier per D-12 and is a footgun anyway). Explicit
path allowlist is safer + simpler. The handler-side
`instant_unexpected_cached_response_total` P0 metric (added in api code,
NOT here) trips an alert if a request OUTSIDE the allowlist ever
responds with cache-hit semantics — defense in depth.
terraform fmt + validate both clean against cloudflare/cloudflare
v5.19.1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Terraform plan —
|
Terraform plan —
|
This was referenced May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stand up the staging.instanode.dev Cloudflare environment, per user direction 2026-05-30:
staging = CF-only, ephemeral state acceptable, manual interruption before any production promotion.
Three logical commits:
feat(terraform): bootstrap Cloudflare resources— 13 files underterraform/cloudflare/. CF provider v5.19.1, R2 state backend, two scoped account_tokens (deploy + admin/tunnel), R2 bucket with 24h-TTL onanon/, DNS records at TTL=60 per the cutover ramp ritual, Pages project forinstanode-web. Workspaces split staging/production. Verified locally withterraform fmt -check -recursive+terraform validate.feat(workflows): terraform plan-on-PR + per-env apply (manual-only)— 3 GHA workflows:terraform.yml— plan-on-PR for both envs (read-only, posts diff as PR comment)terraform-apply-staging.yml— workflow_dispatch only, confirm phraseAPPLY-STAGING, GH Environmentstagingterraform-apply-production.yml— workflow_dispatch only, confirm phraseAPPLY-PRODUCTION, requires numericstaging_run_idvalidated as digits-only ANDgh run viewconfirms that staging run actually succeeded, GH Environmentproductiongates on required reviewers. NO auto-promotion path from staging to production.feat(wrangler): CF Containers scaffold for staging environment— 8 services underwrangler/, plus 2 GHA workflows.cloudflare_containeris NOT in CF TF provider v5.19.1, so Containers are managed via wrangler; TF handles everything else (DNS/R2/Pages/Hyperdrive/Queues/tokens). pg-platform is a custom image that bakes the 63 platform migrations from theapirepo into/docker-entrypoint-initdb.d/; CF Containers' ephemeral disk means migrations re-apply on every cold start, which is the explicit user-blessed tradeoff documented inwrangler/README.md.D-N decisions implemented
D-1 / D-2 / D-3 / D-5 / D-7 / D-8 from
/tmp/cf-migration/shared/DECISIONS.md.Companion PRs
:stagingtag + notify-infra-on-migration workflow:stagingtag:stagingtagTest plan
terraform-fmt-check+terraform-validatejobs greenterraform-planjob for both envs surfaces expected resource creation (no destroy)validate.yml) unaffectedwrangler r2 bucket create instanode-tf-stateout-of-band before anyterraform initterraform-apply-staging.ymlwith confirm phrase, reviews resources createdwrangler-build-staging-images.ymlto bake the 4 custom imageswrangler-deploy-staging.yml service=allto deploy 8 Containerscurl https://api.staging.instanode.dev/healthzreturns 200 with matchingcommit_idKnown not-done in this PR
🤖 Generated with Claude Code