Skip to content

[Epic] Phased deployment: split Terraform monolith into stacked root modules (ADR-013) #119

@Cataldir

Description

@Cataldir

Summary

The current infra/terraform/main.tf provisions ~30 resources in a single terraform apply — VNet, ACR, ACA environment, 8 ACA stub apps, Cosmos DB (+ private endpoint), APIM Consumption, AI Foundry (AVM module), SWA, and 16+ role assignments. On fresh environments this causes Azure control-plane saturation: all 8 ACA apps time out under a single correlation ID (see #118).

This is the parent epic for restructuring the deployment into 7 ordered phases:

  1. Foundation — RG, VNet, ACR, ACA env, Storage, SWA stub, DNS
  2. ACA Apps — 8× Container App stubs + AcrPull RBAC
  3. Data + AI — Cosmos DB, Private Endpoint, AI Foundry (parallel with Phase 2)
  4. Gateway + RBAC — APIM Consumption, Cognitive Services RBAC, Agent RBAC
  5. Image Deploy — Docker build → ACR push → az containerapp update (existing job, re-sequenced)
  6. Seed Datascripts/seed_demo_data.py + Foundry agent provisioning (new job)
  7. Validate — Backend health + frontend→backend connectivity (existing job, enhanced)

Architectural Decision

Adopt stacked Terraform root modules under infra/terraform/stacks/ with terraform_remote_state data sources between them. This extends the existing pattern already used for stacks/services/ (APIM service-edge) and stacks/frontend/.

See ADR-013 for full analysis of alternatives (-target, depends_on, parallelism-only) and why stacked roots were chosen.

Target Workflow DAG

resolve-release → detect-changes
  ├─→ Phase 1: provision-foundation          [needs: detect-changes]
  │       ├─→ Phase 2a: provision-aca-apps    [needs: provision-foundation]  ──┐
  │       └─→ Phase 2b: provision-data-ai     [needs: provision-foundation]  ──┤ (parallel)
  │                                                                            │
  │           Phase 4: provision-gateway-rbac  [needs: aca-apps, data-ai]  ◄───┘
  │               └─→ Phase 4b: provision-service-infra  (existing, per-service matrix)
  │
  ├─→ Phase 5: deploy-backend-services  [needs: aca-apps, service-infra, data-ai]
  │       └─→ Phase 5b: seed-data  [needs: deploy-backend-services, data-ai]
  │
  └─→ Phase 7: post-deploy-guardrails  [needs: deploy-backend-services, seed-data]

Sprint Plan

Sprint Issue Scope Risk
1 #[aca-apps] Extract stacks/aca-apps/ — 8 ACA apps + AcrPull RBAC Medium
2 #[data-ai] Extract stacks/data-ai/ — Cosmos + Foundry High (state surgery)
3 #[gateway-rbac] Extract stacks/gateway-rbac/ — APIM + cross-service RBAC Low
4 #[workflow] Restructure azd-deploy.yml job DAG + seed-data job Medium

Acceptance Criteria

  • Fresh dev environment provisions successfully in < 40 minutes
  • All 8 ACA apps reach Succeeded provisioning state without Operation expired
  • terraform plan on each stack shows 0 changes after initial apply
  • Existing service-edge stacks (stacks/services/) continue to work unchanged
  • Seed data runs and populates demo records via APIM
  • Post-deploy guardrails pass including frontend→backend connectivity
  • Workflow is idempotent — rerun after success produces no changes
  • ADR-013 documented in docs/adr/

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions