Summary
The current infra/terraform/main.tf provisions ~30 resources in a single terraform apply — VNet, ACR, ACA environment, 8 ACA stub apps, Cosmos DB (+ private endpoint), APIM Consumption, AI Foundry (AVM module), SWA, and 16+ role assignments. On fresh environments this causes Azure control-plane saturation: all 8 ACA apps time out under a single correlation ID (see #118).
This is the parent epic for restructuring the deployment into 7 ordered phases:
- Foundation — RG, VNet, ACR, ACA env, Storage, SWA stub, DNS
- ACA Apps — 8× Container App stubs + AcrPull RBAC
- Data + AI — Cosmos DB, Private Endpoint, AI Foundry (parallel with Phase 2)
- Gateway + RBAC — APIM Consumption, Cognitive Services RBAC, Agent RBAC
- Image Deploy — Docker build → ACR push →
az containerapp update (existing job, re-sequenced)
- Seed Data —
scripts/seed_demo_data.py + Foundry agent provisioning (new job)
- Validate — Backend health + frontend→backend connectivity (existing job, enhanced)
Architectural Decision
Adopt stacked Terraform root modules under infra/terraform/stacks/ with terraform_remote_state data sources between them. This extends the existing pattern already used for stacks/services/ (APIM service-edge) and stacks/frontend/.
See ADR-013 for full analysis of alternatives (-target, depends_on, parallelism-only) and why stacked roots were chosen.
Target Workflow DAG
resolve-release → detect-changes
├─→ Phase 1: provision-foundation [needs: detect-changes]
│ ├─→ Phase 2a: provision-aca-apps [needs: provision-foundation] ──┐
│ └─→ Phase 2b: provision-data-ai [needs: provision-foundation] ──┤ (parallel)
│ │
│ Phase 4: provision-gateway-rbac [needs: aca-apps, data-ai] ◄───┘
│ └─→ Phase 4b: provision-service-infra (existing, per-service matrix)
│
├─→ Phase 5: deploy-backend-services [needs: aca-apps, service-infra, data-ai]
│ └─→ Phase 5b: seed-data [needs: deploy-backend-services, data-ai]
│
└─→ Phase 7: post-deploy-guardrails [needs: deploy-backend-services, seed-data]
Sprint Plan
| Sprint |
Issue |
Scope |
Risk |
| 1 |
#[aca-apps] |
Extract stacks/aca-apps/ — 8 ACA apps + AcrPull RBAC |
Medium |
| 2 |
#[data-ai] |
Extract stacks/data-ai/ — Cosmos + Foundry |
High (state surgery) |
| 3 |
#[gateway-rbac] |
Extract stacks/gateway-rbac/ — APIM + cross-service RBAC |
Low |
| 4 |
#[workflow] |
Restructure azd-deploy.yml job DAG + seed-data job |
Medium |
Acceptance Criteria
Dependencies
Summary
The current
infra/terraform/main.tfprovisions ~30 resources in a singleterraform apply— VNet, ACR, ACA environment, 8 ACA stub apps, Cosmos DB (+ private endpoint), APIM Consumption, AI Foundry (AVM module), SWA, and 16+ role assignments. On fresh environments this causes Azure control-plane saturation: all 8 ACA apps time out under a single correlation ID (see #118).This is the parent epic for restructuring the deployment into 7 ordered phases:
az containerapp update(existing job, re-sequenced)scripts/seed_demo_data.py+ Foundry agent provisioning (new job)Architectural Decision
Adopt stacked Terraform root modules under
infra/terraform/stacks/withterraform_remote_statedata sources between them. This extends the existing pattern already used forstacks/services/(APIM service-edge) andstacks/frontend/.See ADR-013 for full analysis of alternatives (-target, depends_on, parallelism-only) and why stacked roots were chosen.
Target Workflow DAG
Sprint Plan
stacks/aca-apps/— 8 ACA apps + AcrPull RBACstacks/data-ai/— Cosmos + Foundrystacks/gateway-rbac/— APIM + cross-service RBACazd-deploy.ymljob DAG + seed-data jobAcceptance Criteria
Succeededprovisioning state withoutOperation expiredterraform planon each stack shows 0 changes after initial applystacks/services/) continue to work unchangeddocs/adr/Dependencies
cancel-in-progress: false