Summary
Restructure azd-deploy.yml from a single-provision-job workflow into a 7-phase DAG with dedicated jobs for each Terraform stack, a new seed-data job, and enhanced guardrails.
Parent epic: #119
Business Context
The current workflow runs a single provision job that applies all ~30 Terraform resources, then matrices service deploys. The phased workflow separates infrastructure into blast-radius-bounded jobs, adds database/agent seeding (currently missing), and validates end-to-end connectivity.
Scope
New jobs to add
| Job |
Phase |
Timeout |
Dependencies |
provision-aca-apps |
2a |
20 min |
provision-foundation |
provision-data-ai |
2b |
45 min |
provision-foundation |
provision-gateway-rbac |
4 |
20 min |
provision-aca-apps, provision-data-ai |
seed-data |
5b |
15 min |
deploy-backend-services, provision-data-ai |
Jobs to modify
| Job |
Change |
provision → provision-foundation |
Rename; strip ACA/Cosmos/APIM/Foundry resources from Terraform root; reduce timeout from 90 → 30 min; remove ACA state reconciliation steps; export foundation outputs |
detect-changes |
Add stack-aware routing outputs: stacks_aca_apps_changed, stacks_data_ai_changed, stacks_gateway_rbac_changed, seed_required |
provision-service-infra |
Update needs: from provision to [provision-aca-apps, provision-gateway-rbac] |
deploy-backend-services |
Update needs: to [provision-aca-apps, provision-service-infra, provision-data-ai]; switch env var sourcing from azd env get-values to upstream job outputs |
post-deploy-guardrails |
Add seed-data to needs:; add frontend→backend connectivity validation step |
Concurrency changes (subsumes #116)
Replace cancel-in-progress: true with cancel-in-progress: false to prevent killing long-running provision phases. Add -lock-timeout=5m to all Terraform apply commands.
seed-data job specification
- Runs on
ubuntu-latest with Python 3.12
- Derives APIM base URL from
AZURE_ENV_NAME: https://tutor-{env}-apim.azure-api.net
- Polls APIM health endpoint for readiness (Consumption cold-start can take 30–45s)
- Runs
python scripts/seed_demo_data.py --base-url "$APIM_BASE_URL"
- Conditional: only on
workflow_dispatch or when provision_required == true
- Idempotent: script handles HTTP 409 as success
detect-changes additions
New case entries for stack directories:
infra/terraform/stacks/aca-apps/* → stacks_aca_apps_changed=true
infra/terraform/stacks/data-ai/* → stacks_data_ai_changed=true
infra/terraform/stacks/gateway-rbac/* → stacks_gateway_rbac_changed=true
scripts/seed_demo_data.py → seed_required=true
When provision_required == true, all stack flags cascade to true.
Timeout Budget
| Job |
Current |
Target |
| provision-foundation |
90 min |
30 min |
| provision-aca-apps |
N/A |
20 min |
| provision-data-ai |
N/A |
45 min |
| provision-gateway-rbac |
N/A |
20 min |
| provision-service-infra |
unchanged |
15 min/svc |
| deploy-backend-services |
unchanged |
30 min/svc |
| seed-data |
N/A |
15 min |
| post-deploy-guardrails |
unchanged |
10 min/svc |
Acceptance Criteria
Dependencies
Rollback
If any phase fails, downstream phases are blocked but completed phases remain stable. Terraform is idempotent — rerunning the workflow continues from the last known state. No manual cleanup needed.
Summary
Restructure
azd-deploy.ymlfrom a single-provision-job workflow into a 7-phase DAG with dedicated jobs for each Terraform stack, a new seed-data job, and enhanced guardrails.Parent epic: #119
Business Context
The current workflow runs a single
provisionjob that applies all ~30 Terraform resources, then matrices service deploys. The phased workflow separates infrastructure into blast-radius-bounded jobs, adds database/agent seeding (currently missing), and validates end-to-end connectivity.Scope
New jobs to add
provision-aca-appsprovision-foundationprovision-data-aiprovision-foundationprovision-gateway-rbacprovision-aca-apps,provision-data-aiseed-datadeploy-backend-services,provision-data-aiJobs to modify
provision→provision-foundationdetect-changesstacks_aca_apps_changed,stacks_data_ai_changed,stacks_gateway_rbac_changed,seed_requiredprovision-service-infraneeds:fromprovisionto[provision-aca-apps, provision-gateway-rbac]deploy-backend-servicesneeds:to[provision-aca-apps, provision-service-infra, provision-data-ai]; switch env var sourcing fromazd env get-valuesto upstream job outputspost-deploy-guardrailsseed-datatoneeds:; add frontend→backend connectivity validation stepConcurrency changes (subsumes #116)
Replace
cancel-in-progress: truewithcancel-in-progress: falseto prevent killing long-running provision phases. Add-lock-timeout=5mto all Terraform apply commands.seed-data job specification
ubuntu-latestwith Python 3.12AZURE_ENV_NAME:https://tutor-{env}-apim.azure-api.netpython scripts/seed_demo_data.py --base-url "$APIM_BASE_URL"workflow_dispatchor whenprovision_required == truedetect-changes additions
New
caseentries for stack directories:When
provision_required == true, all stack flags cascade totrue.Timeout Budget
Acceptance Criteria
cancel-in-progressis disabled; concurrent runs queue correctlyDependencies
Rollback
If any phase fails, downstream phases are blocked but completed phases remain stable. Terraform is idempotent — rerunning the workflow continues from the last known state. No manual cleanup needed.