Skip to content

[Sprint 4] Restructure azd-deploy.yml into 7-phase DAG + seed-data job #123

@Cataldir

Description

@Cataldir

Summary

Restructure azd-deploy.yml from a single-provision-job workflow into a 7-phase DAG with dedicated jobs for each Terraform stack, a new seed-data job, and enhanced guardrails.

Parent epic: #119

Business Context

The current workflow runs a single provision job that applies all ~30 Terraform resources, then matrices service deploys. The phased workflow separates infrastructure into blast-radius-bounded jobs, adds database/agent seeding (currently missing), and validates end-to-end connectivity.

Scope

New jobs to add

Job Phase Timeout Dependencies
provision-aca-apps 2a 20 min provision-foundation
provision-data-ai 2b 45 min provision-foundation
provision-gateway-rbac 4 20 min provision-aca-apps, provision-data-ai
seed-data 5b 15 min deploy-backend-services, provision-data-ai

Jobs to modify

Job Change
provisionprovision-foundation Rename; strip ACA/Cosmos/APIM/Foundry resources from Terraform root; reduce timeout from 90 → 30 min; remove ACA state reconciliation steps; export foundation outputs
detect-changes Add stack-aware routing outputs: stacks_aca_apps_changed, stacks_data_ai_changed, stacks_gateway_rbac_changed, seed_required
provision-service-infra Update needs: from provision to [provision-aca-apps, provision-gateway-rbac]
deploy-backend-services Update needs: to [provision-aca-apps, provision-service-infra, provision-data-ai]; switch env var sourcing from azd env get-values to upstream job outputs
post-deploy-guardrails Add seed-data to needs:; add frontend→backend connectivity validation step

Concurrency changes (subsumes #116)

Replace cancel-in-progress: true with cancel-in-progress: false to prevent killing long-running provision phases. Add -lock-timeout=5m to all Terraform apply commands.

seed-data job specification

  • Runs on ubuntu-latest with Python 3.12
  • Derives APIM base URL from AZURE_ENV_NAME: https://tutor-{env}-apim.azure-api.net
  • Polls APIM health endpoint for readiness (Consumption cold-start can take 30–45s)
  • Runs python scripts/seed_demo_data.py --base-url "$APIM_BASE_URL"
  • Conditional: only on workflow_dispatch or when provision_required == true
  • Idempotent: script handles HTTP 409 as success

detect-changes additions

New case entries for stack directories:

infra/terraform/stacks/aca-apps/*    → stacks_aca_apps_changed=true
infra/terraform/stacks/data-ai/*     → stacks_data_ai_changed=true
infra/terraform/stacks/gateway-rbac/* → stacks_gateway_rbac_changed=true
scripts/seed_demo_data.py            → seed_required=true

When provision_required == true, all stack flags cascade to true.

Timeout Budget

Job Current Target
provision-foundation 90 min 30 min
provision-aca-apps N/A 20 min
provision-data-ai N/A 45 min
provision-gateway-rbac N/A 20 min
provision-service-infra unchanged 15 min/svc
deploy-backend-services unchanged 30 min/svc
seed-data N/A 15 min
post-deploy-guardrails unchanged 10 min/svc

Acceptance Criteria

  • Full workflow_dispatch run completes successfully on fresh dev environment
  • Phase 2a (aca-apps) and Phase 2b (data-ai) run in parallel
  • Phase 4 (gateway-rbac) waits for both Phase 2 jobs
  • seed-data populates demo data via APIM
  • post-deploy-guardrails validate SWA→APIM→ACA connectivity
  • Rerun after success produces no infrastructure changes (idempotent)
  • cancel-in-progress is disabled; concurrent runs queue correctly
  • All existing matrix services deploy correctly (avatar, chat, configuration, essays, evaluation, lms-gateway, questions, upskilling)

Dependencies

Rollback

If any phase fails, downstream phases are blocked but completed phases remain stable. Terraform is idempotent — rerunning the workflow continues from the last known state. No manual cleanup needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions