feat(ci): per-intent + daytime-lifecycle UAT cluster acquisition#1586
Conversation
Implements DC2 (NVIDIA#1275): acquire a UAT cluster shaped by {reservation, intent} through the reservation lease, and add the two cluster lifecycles a reservation can carry (nightly per-run vs long-lived daytime) with an enforced cleanup boundary between them. Per-intent selection: add an intent input (training|inference) to uat-run.yaml, forwarded to the cloud pipelines, selecting tests/uat/<cloud>/tests/ h100-<intent>-config.yaml and the evidence-ingest recipe name. Add the inference sibling configs (criteria intent=inference/platform=dynamo). Both intents drive the same cluster-config (GPU pool from the reservation, system/CPU pools dynamic); the training TrainJob CUJ is gated to intent=training since served inference (phase_serve) is DC3's. AWS capacity: convert the race-and-fail pre-flight into a post-lease assertion on TotalInstanceCount (the reservation's fixed size) rather than AvailableInstanceCount, so contention no longer fails (the lease is the gate) but a genuinely undersized reservation still does. GCP capacity: record the decision to rely on the GKE actuator failing at provision time (no symmetric pre-flight check), documented in docs/contributor/uat.md and noted in uat-gcp.yaml. Daytime lifecycle: add a lifecycle input (nightly|daytime-up|daytime-down). daytime-* use a stable, reservation-tagged deployment.id so the evening teardown and the pre-batch guard can find the held cluster across runs (state is remote, keyed by deployment.id). daytime-up provisions, deploys, and holds; daytime-down tears the held cluster down. Pre-batch guard: every nightly run refuses to provision into a reservation that still holds an un-torn-down daytime cluster (detected by the stable cluster name), failing fast and fail-closed rather than racing the reservation. Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-ci-dc2-per-intent-acquisition.docs.buildwithfern.com/aicr |
📝 WalkthroughWalkthroughThis PR introduces Estimated code review effort: 4 (Complex) | ~60 minutes Possibly related PRs
Suggested labels: Suggested reviewers: 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/uat-gcp.yaml (1)
162-189: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winKeep
daytime-downindependent of build/image setup.Line 165/178/189 skip build and image work, but
daytime-downstill runs unrelated setup/auth steps before teardown. A failure in Go setup, GHCR login, initial GCP auth/setup-gcloud, or unnecessary tool installs can block the cleanup path.Suggested tightening
- name: Setup Go - if: inputs.aicr_version == '' && inputs.skip_tests != true + if: inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down' - name: Authenticate to GHCR + if: inputs.lifecycle != 'daytime-down' uses: ./.github/actions/ghcr-login - name: Authenticate to GCP id: auth + if: inputs.lifecycle != 'daytime-down' uses: google-github-actions/auth@7c6bc770dae815cd3e89ee6cdf493a5fab2cc093 # v3.0.0 - name: Setup gcloud + if: inputs.lifecycle != 'daytime-down' uses: google-github-actions/setup-gcloud@aa5489c8933f4cc7a4f7d45035b3b1440c9c10db # v3.0.1 - install_kubectl: 'true' + install_kubectl: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }} install_yq: 'true' - install_helm: 'true' + install_helm: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }} - install_helmfile: 'true' + install_helmfile: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }} - install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && 'true' || 'false' }} + install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/uat-gcp.yaml around lines 162 - 189, Keep the `daytime-down` path isolated from non-teardown setup: update the workflow around the `Build aicr binary`, `Install released aicr`, `Authenticate to GHCR`, and other bootstrap steps so they are also skipped when `inputs.lifecycle == 'daytime-down'`. Make sure the teardown job only runs the cleanup/auth needed for `daytime-down`, and avoid any Go build, release install, GHCR login, or image-related preparation in that path..github/workflows/uat-aws.yaml (1)
164-191: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winKeep
daytime-downindependent of build/image setup.Line 167/180/191 skip the heavy build/push steps, but a
daytime-downrun still executes unrelated setup such as Go setup, GHCR login, initial AWS auth, and ko installation before teardown. Any failure there can prevent config update/tool setup and turn an explicit cleanup run into a leaked cluster.Suggested tightening
- name: Setup Go - if: inputs.aicr_version == '' && inputs.skip_tests != true + if: inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down' - name: Authenticate to GHCR + if: inputs.lifecycle != 'daytime-down' uses: ./.github/actions/ghcr-login - name: Configure AWS credentials id: auth + if: inputs.lifecycle != 'daytime-down' uses: aws-actions/configure-aws-credentials@254c19bd240aabef8777f48595e9d2d7b972184b # v6.2.1 - install_kubectl: 'true' + install_kubectl: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }} install_yq: 'true' - install_helm: 'true' + install_helm: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }} - install_helmfile: 'true' + install_helmfile: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }} - install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && 'true' || 'false' }} + install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/uat-aws.yaml around lines 164 - 191, The daytime-down path is still entering unrelated setup before teardown, which can block cleanup if those steps fail. Update the workflow conditions around the setup sequence in uat-aws.yaml so the daytime-down lifecycle skips nonessential prep like Go setup, GHCR login, AWS auth, and ko installation, not just the build/push jobs. Use the existing lifecycle checks on the same steps around Build aicr binary, Install released aicr, Authenticate to GHCR, and the AWS/ko setup blocks to keep daytime-down independent and teardown-only.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/uat-aws.yaml:
- Around line 516-530: The summary step in the workflow is embedding
github.ref_name directly inside a double-quoted shell echo/printf command, which
can allow shell metacharacters in the ref name to be interpreted. Move the ref
name into an env variable for this step and print that variable as data in the
summary block, using the existing run section that writes the UAT Results and
references github.sha/github.ref_name.
In @.github/workflows/uat-gcp.yaml:
- Around line 477-491: The UAT summary step is interpolating github.ref_name
directly inside the shell block, which should be treated as data instead of
inline shell text. Update the summary step in the workflow job that prints “UAT
Results (GCP)” to pass github.ref_name through env alongside the other SUMMARY_*
values, then use printf (or equivalent) to render the branch name in the build
line so the ref name is not shell-expanded.
---
Outside diff comments:
In @.github/workflows/uat-aws.yaml:
- Around line 164-191: The daytime-down path is still entering unrelated setup
before teardown, which can block cleanup if those steps fail. Update the
workflow conditions around the setup sequence in uat-aws.yaml so the
daytime-down lifecycle skips nonessential prep like Go setup, GHCR login, AWS
auth, and ko installation, not just the build/push jobs. Use the existing
lifecycle checks on the same steps around Build aicr binary, Install released
aicr, Authenticate to GHCR, and the AWS/ko setup blocks to keep daytime-down
independent and teardown-only.
In @.github/workflows/uat-gcp.yaml:
- Around line 162-189: Keep the `daytime-down` path isolated from non-teardown
setup: update the workflow around the `Build aicr binary`, `Install released
aicr`, `Authenticate to GHCR`, and other bootstrap steps so they are also
skipped when `inputs.lifecycle == 'daytime-down'`. Make sure the teardown job
only runs the cleanup/auth needed for `daytime-down`, and avoid any Go build,
release install, GHCR login, or image-related preparation in that path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: fc868d63-8b16-4958-a737-41ef8f14c930
📒 Files selected for processing (6)
.github/workflows/uat-aws.yaml.github/workflows/uat-gcp.yaml.github/workflows/uat-run.yamldocs/contributor/uat.mdtests/uat/aws/tests/h100-inference-config.yamltests/uat/gcp/tests/h100-inference-config.yaml
| SUMMARY_INTENT: ${{ inputs.intent }} | ||
| SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }} | ||
| run: | | ||
| { | ||
| echo "## UAT Results (AWS)" | ||
| echo "" | ||
| printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION" | ||
| printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \ | ||
| "$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE" | ||
| printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID" | ||
| printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}" | ||
| echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)" | ||
| echo "" | ||
| echo "| Phase | Status |" | ||
| echo "|-------|--------|" | ||
| echo "| Pre-batch guard | ${{ steps.guard.outcome }} |" |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Avoid shell-expanding github.ref_name in the summary step.
Line 526 injects github.ref_name directly into a double-quoted shell command. Branch/ref names can contain shell metacharacters, so pass it through env and print it as data instead.
Suggested fix
env:
SUMMARY_RESERVATION: ${{ inputs.reservation }}
SUMMARY_AICR_VERSION: ${{ inputs.aicr_version }}
SUMMARY_INTENT: ${{ inputs.intent }}
SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
+ SUMMARY_SHA: ${{ github.sha }}
+ SUMMARY_REF_NAME: ${{ github.ref_name }}
run: |
{
echo "## UAT Results (AWS)"
@@
- echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
+ printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SUMMARY_INTENT: ${{ inputs.intent }} | |
| SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }} | |
| run: | | |
| { | |
| echo "## UAT Results (AWS)" | |
| echo "" | |
| printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION" | |
| printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \ | |
| "$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE" | |
| printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID" | |
| printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}" | |
| echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)" | |
| echo "" | |
| echo "| Phase | Status |" | |
| echo "|-------|--------|" | |
| echo "| Pre-batch guard | ${{ steps.guard.outcome }} |" | |
| SUMMARY_INTENT: ${{ inputs.intent }} | |
| SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }} | |
| SUMMARY_SHA: ${{ github.sha }} | |
| SUMMARY_REF_NAME: ${{ github.ref_name }} | |
| run: | | |
| { | |
| echo "## UAT Results (AWS)" | |
| echo "" | |
| printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \ | |
| "$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE" | |
| printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID" | |
| printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}" | |
| printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME" | |
| echo "" | |
| echo "| Phase | Status |" | |
| echo "|-------|--------|" | |
| echo "| Pre-batch guard | ${{ steps.guard.outcome }} |" |
🧰 Tools
🪛 zizmor (1.26.1)
[warning] 526-526: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[error] 526-526: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[info] 530-530: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/uat-aws.yaml around lines 516 - 530, The summary step in
the workflow is embedding github.ref_name directly inside a double-quoted shell
echo/printf command, which can allow shell metacharacters in the ref name to be
interpreted. Move the ref name into an env variable for this step and print that
variable as data in the summary block, using the existing run section that
writes the UAT Results and references github.sha/github.ref_name.
Source: Linters/SAST tools
| SUMMARY_INTENT: ${{ inputs.intent }} | ||
| SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }} | ||
| run: | | ||
| { | ||
| echo "## UAT Results (GCP)" | ||
| echo "" | ||
| printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION" | ||
| printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \ | ||
| "$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE" | ||
| printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID" | ||
| printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}" | ||
| echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)" | ||
| echo "" | ||
| echo "| Phase | Status |" | ||
| echo "|-------|--------|" | ||
| echo "| Pre-batch guard | ${{ steps.guard.outcome }} |" |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Avoid shell-expanding github.ref_name in the summary step.
Line 487 injects github.ref_name directly into a double-quoted shell command. Pass it via env and print with printf so the ref name is treated as data.
Suggested fix
env:
SUMMARY_RESERVATION: ${{ inputs.reservation }}
SUMMARY_AICR_VERSION: ${{ inputs.aicr_version }}
SUMMARY_INTENT: ${{ inputs.intent }}
SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
+ SUMMARY_SHA: ${{ github.sha }}
+ SUMMARY_REF_NAME: ${{ github.ref_name }}
run: |
{
echo "## UAT Results (GCP)"
@@
- echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
+ printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SUMMARY_INTENT: ${{ inputs.intent }} | |
| SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }} | |
| run: | | |
| { | |
| echo "## UAT Results (GCP)" | |
| echo "" | |
| printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION" | |
| printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \ | |
| "$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE" | |
| printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID" | |
| printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}" | |
| echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)" | |
| echo "" | |
| echo "| Phase | Status |" | |
| echo "|-------|--------|" | |
| echo "| Pre-batch guard | ${{ steps.guard.outcome }} |" | |
| SUMMARY_INTENT: ${{ inputs.intent }} | |
| SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }} | |
| SUMMARY_SHA: ${{ github.sha }} | |
| SUMMARY_REF_NAME: ${{ github.ref_name }} | |
| run: | | |
| { | |
| echo "## UAT Results (GCP)" | |
| echo "" | |
| printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \ | |
| "$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE" | |
| printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID" | |
| printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}" | |
| printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME" | |
| echo "" | |
| echo "| Phase | Status |" | |
| echo "|-------|--------|" | |
| echo "| Pre-batch guard | ${{ steps.guard.outcome }} |" |
🧰 Tools
🪛 zizmor (1.26.1)
[warning] 487-487: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[error] 487-487: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[info] 491-491: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/uat-gcp.yaml around lines 477 - 491, The UAT summary step
is interpolating github.ref_name directly inside the shell block, which should
be treated as data instead of inline shell text. Update the summary step in the
workflow job that prints “UAT Results (GCP)” to pass github.ref_name through env
alongside the other SUMMARY_* values, then use printf (or equivalent) to render
the branch name in the build line so the ref name is not shell-expanded.
Source: Linters/SAST tools
Implements DC8 (NVIDIA#1281): stand up one long-lived, human-facing deployment per cloud for the working day, then tear it down before the nightly batch. DC2 (NVIDIA#1586) already owns the daytime-up/daytime-down lifecycle, provision- and-hold, teardown, and pre-batch guard; DC8 adds the orchestration on top — the cloud→flavor data mapping, the scheduler, and the out-of-band access path. Cloud→flavor split (data, not code): add an optional daytime-intent column to the reservation registry (aws-h100=training, gcp-h100=inference at launch; empty = nightly-batch only). pkg/uatbroker gains the DaytimeIntent field, intent constants + validation, and DaytimeAssignments(); uat-broker gains a `reservations --daytime` JSON matrix output. Only one daytime reservation per cloud is allowed (a reservation cannot hold both a held daytime cluster and the nightly batch at once) — enforced by the committed-registry test. Scheduler (uat-daytime.yaml): a thin scheduler over the daytime-up/daytime-down mechanics. A morning cron dispatches daytime-up per rotation reservation, an evening cron (before the 04:00 batch) dispatches daytime-down; a manual workflow_dispatch(action=up|down) covers ad-hoc runs. Each dispatch routes through uat-run.yaml so it takes the same per-reservation lease as the batch, and is watched to completion so a failed handoff/teardown surfaces. The daytime cluster is not a UAT cell — daytime-up stops after deploy, emitting no evidence bundle and no TestGrid column. Docs (docs/contributor/uat.md): document the cloud→flavor split, the scheduler and its cron edges, missed-teardown recovery via DC2's pre-batch guard, and the out-of-band access path — stable cluster names gated by cloud IAM so no credential transits CI, with TrainJob submission (AWS) and the Dynamo OpenAI endpoint port-forward (GCP). The served DynamoGraphDeployment remains DC3's phase_serve; until it lands the served workload is a documented manual apply. Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
Implements DC8 (NVIDIA#1281): stand up one long-lived, human-facing deployment per cloud for the working day, then tear it down before the nightly batch. DC2 (NVIDIA#1586) already owns the daytime-up/daytime-down lifecycle, provision- and-hold, teardown, and pre-batch guard; DC8 adds the orchestration on top — the cloud→flavor data mapping, the scheduler, and the out-of-band access path. Cloud→flavor split (data, not code): add an optional daytime-intent column to the reservation registry (aws-h100=training, gcp-h100=inference at launch; empty = nightly-batch only). pkg/uatbroker gains the DaytimeIntent field, intent constants + validation, and DaytimeAssignments(); uat-broker gains a `reservations --daytime` JSON matrix output. Only one daytime reservation per cloud is allowed (a reservation cannot hold both a held daytime cluster and the nightly batch at once) — enforced by the committed-registry test. Scheduler (uat-daytime.yaml): a thin scheduler over the daytime-up/daytime-down mechanics. A morning cron dispatches daytime-up per rotation reservation, an evening cron (before the 04:00 batch) dispatches daytime-down; a manual workflow_dispatch(action=up|down) covers ad-hoc runs. Each dispatch routes through uat-run.yaml so it takes the same per-reservation lease as the batch, and is watched to completion so a failed handoff/teardown surfaces. The daytime cluster is not a UAT cell — daytime-up stops after deploy, emitting no evidence bundle and no TestGrid column. Docs (docs/contributor/uat.md): document the cloud→flavor split, the scheduler and its cron edges, missed-teardown recovery via DC2's pre-batch guard, and the out-of-band access path — stable cluster names gated by cloud IAM so no credential transits CI, with TrainJob submission (AWS) and the Dynamo OpenAI endpoint port-forward (GCP). The served DynamoGraphDeployment remains DC3's phase_serve; until it lands the served workload is a documented manual apply. Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
Summary
Acquire a UAT cluster shaped by
{reservation, intent}through the reservation lease, and add the two cluster lifecycles a reservation can carry — the nightly per-run cluster and the long-lived daytime deployment — with an enforced cleanup boundary between them.Motivation / Context
Implements DC2. AWS previously hard-failed on a busy reservation in a race-and-fail pre-flight; the test config and recipe intent were hard-coded to training; and there was no mechanism for the morning-handoff daytime cluster or a guard preventing the nightly batch from racing an un-torn-down daytime deployment. This builds on DC1's reservation lease and dispatch surface.
Fixes: #1275
Related: #1264, #1274
Type of Change
Component(s) Affected
docs/,examples/).github/workflows/uat-*.yaml) and UAT test configs (tests/uat/**)Implementation Notes
intentinput (training|inference) onuat-run.yaml, forwarded to the cloud pipelines, selectstests/uat/<cloud>/tests/h100-<intent>-config.yamland the evidence-ingest recipe name. New inference sibling configs (intent: inference/platform: dynamo). Both intents drive the samecluster-config.yaml(GPU pool from the reservation, system/CPU pools dynamic). The training TrainJob CUJ is gated tointent=training; served inference (phase_serve) is DC3's.TotalInstanceCount >= desired(the reservation's fixed size) rather thanAvailableInstanceCount, so contention no longer hard-fails (the lease is the gate) but a genuinely undersized reservation still does.docs/contributor/uat.mdand noted inuat-gcp.yaml(no phantom step).lifecycleinput (nightly|daytime-up|daytime-down). Daytime modes use a stable, reservation-taggeddeployment.idso teardown and the guard find the held cluster across runs (Terraform state is remote, keyed bydeployment.id).daytime-upprovisions + deploys + holds;daytime-downtears the held cluster down.nightlyrun refuses to provision into a reservation that still holds an un-torn-down daytime cluster (detected by the stable cluster name), failing fast and fail-closed rather than racing.Testing
Both inference configs resolve (recipe + bundle produce a valid
bundle/helmfile.yaml);validate --no-clusteronly needs a runtime snapshot. yamllint, actionlint (pinned v1.7.11, matching the merge gate's-shellcheck=), and the docs filename/MDX checks pass. No Go changes. Real UAT (provision on live GPU hardware) is not runnable locally and is exercised by dispatching the workflow.Risk Assessment
Rollout notes:
intentandlifecycledefault totraining/nightly, so the nightly cron and existing dispatches are unchanged. The daytime lifecycle's workload content and access distribution are DC8's and layer on top of thedaytime-upmechanic shipped here.Checklist
make testwith-race) — N/A, no Go changesmake lint) — yamllint + actionlint + docs checks passdocs/contributor/uat.mdgit commit -S)