Skip to content

feat(ci): per-intent + daytime-lifecycle UAT cluster acquisition#1586

Merged
njhensley merged 1 commit into
NVIDIA:mainfrom
njhensley:ci/dc2-per-intent-acquisition
Jul 2, 2026
Merged

feat(ci): per-intent + daytime-lifecycle UAT cluster acquisition#1586
njhensley merged 1 commit into
NVIDIA:mainfrom
njhensley:ci/dc2-per-intent-acquisition

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Acquire a UAT cluster shaped by {reservation, intent} through the reservation lease, and add the two cluster lifecycles a reservation can carry — the nightly per-run cluster and the long-lived daytime deployment — with an enforced cleanup boundary between them.

Motivation / Context

Implements DC2. AWS previously hard-failed on a busy reservation in a race-and-fail pre-flight; the test config and recipe intent were hard-coded to training; and there was no mechanism for the morning-handoff daytime cluster or a guard preventing the nightly batch from racing an un-torn-down daytime deployment. This builds on DC1's reservation lease and dispatch surface.

Fixes: #1275
Related: #1264, #1274

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update
  • Build/CI/tooling

Component(s) Affected

  • Docs/examples (docs/, examples/)
  • Other: UAT CI workflows (.github/workflows/uat-*.yaml) and UAT test configs (tests/uat/**)

Implementation Notes

  • Per-intent selection. intent input (training|inference) on uat-run.yaml, forwarded to the cloud pipelines, selects tests/uat/<cloud>/tests/h100-<intent>-config.yaml and the evidence-ingest recipe name. New inference sibling configs (intent: inference / platform: dynamo). Both intents drive the same cluster-config.yaml (GPU pool from the reservation, system/CPU pools dynamic). The training TrainJob CUJ is gated to intent=training; served inference (phase_serve) is DC3's.
  • AWS capacity → post-lease assertion. Now asserts TotalInstanceCount >= desired (the reservation's fixed size) rather than AvailableInstanceCount, so contention no longer hard-fails (the lease is the gate) but a genuinely undersized reservation still does.
  • GCP capacity — decided posture. No symmetric pre-flight check; GCP relies on the GKE actuator failing at provision time. Documented in docs/contributor/uat.md and noted in uat-gcp.yaml (no phantom step).
  • Daytime lifecycle. lifecycle input (nightly|daytime-up|daytime-down). Daytime modes use a stable, reservation-tagged deployment.id so teardown and the guard find the held cluster across runs (Terraform state is remote, keyed by deployment.id). daytime-up provisions + deploys + holds; daytime-down tears the held cluster down.
  • Pre-batch guard. Every nightly run refuses to provision into a reservation that still holds an un-torn-down daytime cluster (detected by the stable cluster name), failing fast and fail-closed rather than racing.

Testing

# Config resolution (recipe + bundle) for both new inference configs
aicr recipe --config tests/uat/aws/tests/h100-inference-config.yaml
aicr bundle --config tests/uat/aws/tests/h100-inference-config.yaml   # + gcp
# Workflow + docs lint
yamllint -c .yamllint.yaml .github/workflows/uat-*.yaml
actionlint -shellcheck= .github/workflows/uat-*.yaml
./tools/check-docs-filenames && ./tools/check-docs-mdx

Both inference configs resolve (recipe + bundle produce a valid bundle/helmfile.yaml); validate --no-cluster only needs a runtime snapshot. yamllint, actionlint (pinned v1.7.11, matching the merge gate's -shellcheck=), and the docs filename/MDX checks pass. No Go changes. Real UAT (provision on live GPU hardware) is not runnable locally and is exercised by dispatching the workflow.

Risk Assessment

  • Low — CI/config/docs only, no Go changes, easy to revert; new lifecycles default to the existing nightly behavior.

Rollout notes: intent and lifecycle default to training/nightly, so the nightly cron and existing dispatches are unchanged. The daytime lifecycle's workload content and access distribution are DC8's and layer on top of the daytime-up mechanic shipped here.

Checklist

  • Tests pass locally (make test with -race) — N/A, no Go changes
  • Linter passes (make lint) — yamllint + actionlint + docs checks pass
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — new inference AICRConfigs (validated via recipe/bundle resolution + existing cuj2-inference chainsaw tests)
  • I updated docs if user-facing behavior changed — docs/contributor/uat.md
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Implements DC2 (NVIDIA#1275): acquire a UAT cluster shaped by {reservation, intent}
through the reservation lease, and add the two cluster lifecycles a reservation
can carry (nightly per-run vs long-lived daytime) with an enforced cleanup
boundary between them.

Per-intent selection: add an intent input (training|inference) to uat-run.yaml,
forwarded to the cloud pipelines, selecting tests/uat/<cloud>/tests/
h100-<intent>-config.yaml and the evidence-ingest recipe name. Add the inference
sibling configs (criteria intent=inference/platform=dynamo). Both intents drive
the same cluster-config (GPU pool from the reservation, system/CPU pools
dynamic); the training TrainJob CUJ is gated to intent=training since served
inference (phase_serve) is DC3's.

AWS capacity: convert the race-and-fail pre-flight into a post-lease assertion
on TotalInstanceCount (the reservation's fixed size) rather than
AvailableInstanceCount, so contention no longer fails (the lease is the gate)
but a genuinely undersized reservation still does.

GCP capacity: record the decision to rely on the GKE actuator failing at
provision time (no symmetric pre-flight check), documented in
docs/contributor/uat.md and noted in uat-gcp.yaml.

Daytime lifecycle: add a lifecycle input (nightly|daytime-up|daytime-down).
daytime-* use a stable, reservation-tagged deployment.id so the evening teardown
and the pre-batch guard can find the held cluster across runs (state is remote,
keyed by deployment.id). daytime-up provisions, deploys, and holds; daytime-down
tears the held cluster down.

Pre-batch guard: every nightly run refuses to provision into a reservation that
still holds an un-torn-down daytime cluster (detected by the stable cluster
name), failing fast and fail-closed rather than racing the reservation.

Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
@njhensley njhensley requested review from a team as code owners July 1, 2026 23:58
@njhensley njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@njhensley njhensley enabled auto-merge (squash) July 2, 2026 00:07
@njhensley njhensley merged commit 55fd14f into NVIDIA:main Jul 2, 2026
36 of 37 checks passed
@njhensley njhensley deleted the ci/dc2-per-intent-acquisition branch July 2, 2026 00:08
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces intent and lifecycle dials across the UAT workflow suite (uat-run.yaml, uat-aws.yaml, uat-gcp.yaml). It adds input validation, lifecycle-based provisioning/teardown gating (nightly, daytime-up, daytime-down), a pre-batch guard checking for existing daytime clusters, a reworked AWS capacity assertion using total instance count, intent-driven test config and evidence recipe selection, expanded Test Summary output, new H100 inference test config YAML files for AWS and GCP, and updated contributor documentation.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related PRs

  • NVIDIA/aicr#1569: Prior refactor of uat-run.yaml dispatch inputs and uat-aws.yaml/uat-gcp.yaml config selection that this PR's intent/lifecycle wiring builds upon.

Suggested labels: area/infra

Suggested reviewers: mchmarny, xdu31

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: per-intent UAT cluster acquisition with daytime lifecycles.
Description check ✅ Passed The description matches the workflow, config, and docs changes and explains the new intent and lifecycle behavior.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.github/workflows/uat-gcp.yaml (1)

162-189: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Keep daytime-down independent of build/image setup.

Line 165/178/189 skip build and image work, but daytime-down still runs unrelated setup/auth steps before teardown. A failure in Go setup, GHCR login, initial GCP auth/setup-gcloud, or unnecessary tool installs can block the cleanup path.

Suggested tightening
       - name: Setup Go
-        if: inputs.aicr_version == '' && inputs.skip_tests != true
+        if: inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down'

       - name: Authenticate to GHCR
+        if: inputs.lifecycle != 'daytime-down'
         uses: ./.github/actions/ghcr-login

       - name: Authenticate to GCP
         id: auth
+        if: inputs.lifecycle != 'daytime-down'
         uses: google-github-actions/auth@7c6bc770dae815cd3e89ee6cdf493a5fab2cc093  # v3.0.0

       - name: Setup gcloud
+        if: inputs.lifecycle != 'daytime-down'
         uses: google-github-actions/setup-gcloud@aa5489c8933f4cc7a4f7d45035b3b1440c9c10db  # v3.0.1

-          install_kubectl: 'true'
+          install_kubectl: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
           install_yq: 'true'
-          install_helm: 'true'
+          install_helm: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
-          install_helmfile: 'true'
+          install_helmfile: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
-          install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && 'true' || 'false' }}
+          install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-gcp.yaml around lines 162 - 189, Keep the
`daytime-down` path isolated from non-teardown setup: update the workflow around
the `Build aicr binary`, `Install released aicr`, `Authenticate to GHCR`, and
other bootstrap steps so they are also skipped when `inputs.lifecycle ==
'daytime-down'`. Make sure the teardown job only runs the cleanup/auth needed
for `daytime-down`, and avoid any Go build, release install, GHCR login, or
image-related preparation in that path.
.github/workflows/uat-aws.yaml (1)

164-191: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Keep daytime-down independent of build/image setup.

Line 167/180/191 skip the heavy build/push steps, but a daytime-down run still executes unrelated setup such as Go setup, GHCR login, initial AWS auth, and ko installation before teardown. Any failure there can prevent config update/tool setup and turn an explicit cleanup run into a leaked cluster.

Suggested tightening
       - name: Setup Go
-        if: inputs.aicr_version == '' && inputs.skip_tests != true
+        if: inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down'

       - name: Authenticate to GHCR
+        if: inputs.lifecycle != 'daytime-down'
         uses: ./.github/actions/ghcr-login

       - name: Configure AWS credentials
         id: auth
+        if: inputs.lifecycle != 'daytime-down'
         uses: aws-actions/configure-aws-credentials@254c19bd240aabef8777f48595e9d2d7b972184b  # v6.2.1

-          install_kubectl: 'true'
+          install_kubectl: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
           install_yq: 'true'
-          install_helm: 'true'
+          install_helm: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
-          install_helmfile: 'true'
+          install_helmfile: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
-          install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && 'true' || 'false' }}
+          install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-aws.yaml around lines 164 - 191, The daytime-down path
is still entering unrelated setup before teardown, which can block cleanup if
those steps fail. Update the workflow conditions around the setup sequence in
uat-aws.yaml so the daytime-down lifecycle skips nonessential prep like Go
setup, GHCR login, AWS auth, and ko installation, not just the build/push jobs.
Use the existing lifecycle checks on the same steps around Build aicr binary,
Install released aicr, Authenticate to GHCR, and the AWS/ko setup blocks to keep
daytime-down independent and teardown-only.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/uat-aws.yaml:
- Around line 516-530: The summary step in the workflow is embedding
github.ref_name directly inside a double-quoted shell echo/printf command, which
can allow shell metacharacters in the ref name to be interpreted. Move the ref
name into an env variable for this step and print that variable as data in the
summary block, using the existing run section that writes the UAT Results and
references github.sha/github.ref_name.

In @.github/workflows/uat-gcp.yaml:
- Around line 477-491: The UAT summary step is interpolating github.ref_name
directly inside the shell block, which should be treated as data instead of
inline shell text. Update the summary step in the workflow job that prints “UAT
Results (GCP)” to pass github.ref_name through env alongside the other SUMMARY_*
values, then use printf (or equivalent) to render the branch name in the build
line so the ref name is not shell-expanded.

---

Outside diff comments:
In @.github/workflows/uat-aws.yaml:
- Around line 164-191: The daytime-down path is still entering unrelated setup
before teardown, which can block cleanup if those steps fail. Update the
workflow conditions around the setup sequence in uat-aws.yaml so the
daytime-down lifecycle skips nonessential prep like Go setup, GHCR login, AWS
auth, and ko installation, not just the build/push jobs. Use the existing
lifecycle checks on the same steps around Build aicr binary, Install released
aicr, Authenticate to GHCR, and the AWS/ko setup blocks to keep daytime-down
independent and teardown-only.

In @.github/workflows/uat-gcp.yaml:
- Around line 162-189: Keep the `daytime-down` path isolated from non-teardown
setup: update the workflow around the `Build aicr binary`, `Install released
aicr`, `Authenticate to GHCR`, and other bootstrap steps so they are also
skipped when `inputs.lifecycle == 'daytime-down'`. Make sure the teardown job
only runs the cleanup/auth needed for `daytime-down`, and avoid any Go build,
release install, GHCR login, or image-related preparation in that path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: fc868d63-8b16-4958-a737-41ef8f14c930

📥 Commits

Reviewing files that changed from the base of the PR and between 83d9825 and eec2a1b.

📒 Files selected for processing (6)
  • .github/workflows/uat-aws.yaml
  • .github/workflows/uat-gcp.yaml
  • .github/workflows/uat-run.yaml
  • docs/contributor/uat.md
  • tests/uat/aws/tests/h100-inference-config.yaml
  • tests/uat/gcp/tests/h100-inference-config.yaml

Comment on lines +516 to +530
SUMMARY_INTENT: ${{ inputs.intent }}
SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
run: |
{
echo "## UAT Results (AWS)"
echo ""
printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION"
printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \
"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"
printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"
printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"
echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
echo ""
echo "| Phase | Status |"
echo "|-------|--------|"
echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Avoid shell-expanding github.ref_name in the summary step.

Line 526 injects github.ref_name directly into a double-quoted shell command. Branch/ref names can contain shell metacharacters, so pass it through env and print it as data instead.

Suggested fix
         env:
           SUMMARY_RESERVATION: ${{ inputs.reservation }}
           SUMMARY_AICR_VERSION: ${{ inputs.aicr_version }}
           SUMMARY_INTENT: ${{ inputs.intent }}
           SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
+          SUMMARY_SHA: ${{ github.sha }}
+          SUMMARY_REF_NAME: ${{ github.ref_name }}
         run: |
           {
             echo "## UAT Results (AWS)"
@@
-            echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
+            printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SUMMARY_INTENT: ${{ inputs.intent }}
SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
run: |
{
echo "## UAT Results (AWS)"
echo ""
printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION"
printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \
"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"
printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"
printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"
echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
echo ""
echo "| Phase | Status |"
echo "|-------|--------|"
echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"
SUMMARY_INTENT: ${{ inputs.intent }}
SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
SUMMARY_SHA: ${{ github.sha }}
SUMMARY_REF_NAME: ${{ github.ref_name }}
run: |
{
echo "## UAT Results (AWS)"
echo ""
printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \
"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"
printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"
printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"
printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"
echo ""
echo "| Phase | Status |"
echo "|-------|--------|"
echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"
🧰 Tools
🪛 zizmor (1.26.1)

[warning] 526-526: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[error] 526-526: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[info] 530-530: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-aws.yaml around lines 516 - 530, The summary step in
the workflow is embedding github.ref_name directly inside a double-quoted shell
echo/printf command, which can allow shell metacharacters in the ref name to be
interpreted. Move the ref name into an env variable for this step and print that
variable as data in the summary block, using the existing run section that
writes the UAT Results and references github.sha/github.ref_name.

Source: Linters/SAST tools

Comment on lines +477 to +491
SUMMARY_INTENT: ${{ inputs.intent }}
SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
run: |
{
echo "## UAT Results (GCP)"
echo ""
printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION"
printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \
"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"
printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"
printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"
echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
echo ""
echo "| Phase | Status |"
echo "|-------|--------|"
echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Avoid shell-expanding github.ref_name in the summary step.

Line 487 injects github.ref_name directly into a double-quoted shell command. Pass it via env and print with printf so the ref name is treated as data.

Suggested fix
         env:
           SUMMARY_RESERVATION: ${{ inputs.reservation }}
           SUMMARY_AICR_VERSION: ${{ inputs.aicr_version }}
           SUMMARY_INTENT: ${{ inputs.intent }}
           SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
+          SUMMARY_SHA: ${{ github.sha }}
+          SUMMARY_REF_NAME: ${{ github.ref_name }}
         run: |
           {
             echo "## UAT Results (GCP)"
@@
-            echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
+            printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SUMMARY_INTENT: ${{ inputs.intent }}
SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
run: |
{
echo "## UAT Results (GCP)"
echo ""
printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION"
printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \
"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"
printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"
printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"
echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
echo ""
echo "| Phase | Status |"
echo "|-------|--------|"
echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"
SUMMARY_INTENT: ${{ inputs.intent }}
SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
SUMMARY_SHA: ${{ github.sha }}
SUMMARY_REF_NAME: ${{ github.ref_name }}
run: |
{
echo "## UAT Results (GCP)"
echo ""
printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \
"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"
printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"
printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"
printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"
echo ""
echo "| Phase | Status |"
echo "|-------|--------|"
echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"
🧰 Tools
🪛 zizmor (1.26.1)

[warning] 487-487: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[error] 487-487: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[info] 491-491: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-gcp.yaml around lines 477 - 491, The UAT summary step
is interpolating github.ref_name directly inside the shell block, which should
be treated as data instead of inline shell text. Update the summary step in the
workflow job that prints “UAT Results (GCP)” to pass github.ref_name through env
alongside the other SUMMARY_* values, then use printf (or equivalent) to render
the branch name in the build line so the ref name is not shell-expanded.

Source: Linters/SAST tools

njhensley added a commit to njhensley/aicr that referenced this pull request Jul 2, 2026
Implements DC8 (NVIDIA#1281): stand up one long-lived, human-facing deployment
per cloud for the working day, then tear it down before the nightly batch.
DC2 (NVIDIA#1586) already owns the daytime-up/daytime-down lifecycle, provision-
and-hold, teardown, and pre-batch guard; DC8 adds the orchestration on top —
the cloud→flavor data mapping, the scheduler, and the out-of-band access path.

Cloud→flavor split (data, not code): add an optional daytime-intent column
to the reservation registry (aws-h100=training, gcp-h100=inference at launch;
empty = nightly-batch only). pkg/uatbroker gains the DaytimeIntent field,
intent constants + validation, and DaytimeAssignments(); uat-broker gains a
`reservations --daytime` JSON matrix output. Only one daytime reservation per
cloud is allowed (a reservation cannot hold both a held daytime cluster and
the nightly batch at once) — enforced by the committed-registry test.

Scheduler (uat-daytime.yaml): a thin scheduler over the daytime-up/daytime-down
mechanics. A morning cron dispatches daytime-up per rotation reservation, an
evening cron (before the 04:00 batch) dispatches daytime-down; a manual
workflow_dispatch(action=up|down) covers ad-hoc runs. Each dispatch routes
through uat-run.yaml so it takes the same per-reservation lease as the batch,
and is watched to completion so a failed handoff/teardown surfaces. The daytime
cluster is not a UAT cell — daytime-up stops after deploy, emitting no evidence
bundle and no TestGrid column.

Docs (docs/contributor/uat.md): document the cloud→flavor split, the scheduler
and its cron edges, missed-teardown recovery via DC2's pre-batch guard, and the
out-of-band access path — stable cluster names gated by cloud IAM so no
credential transits CI, with TrainJob submission (AWS) and the Dynamo OpenAI
endpoint port-forward (GCP). The served DynamoGraphDeployment remains DC3's
phase_serve; until it lands the served workload is a documented manual apply.

Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
njhensley added a commit to njhensley/aicr that referenced this pull request Jul 2, 2026
Implements DC8 (NVIDIA#1281): stand up one long-lived, human-facing deployment
per cloud for the working day, then tear it down before the nightly batch.
DC2 (NVIDIA#1586) already owns the daytime-up/daytime-down lifecycle, provision-
and-hold, teardown, and pre-batch guard; DC8 adds the orchestration on top —
the cloud→flavor data mapping, the scheduler, and the out-of-band access path.

Cloud→flavor split (data, not code): add an optional daytime-intent column
to the reservation registry (aws-h100=training, gcp-h100=inference at launch;
empty = nightly-batch only). pkg/uatbroker gains the DaytimeIntent field,
intent constants + validation, and DaytimeAssignments(); uat-broker gains a
`reservations --daytime` JSON matrix output. Only one daytime reservation per
cloud is allowed (a reservation cannot hold both a held daytime cluster and
the nightly batch at once) — enforced by the committed-registry test.

Scheduler (uat-daytime.yaml): a thin scheduler over the daytime-up/daytime-down
mechanics. A morning cron dispatches daytime-up per rotation reservation, an
evening cron (before the 04:00 batch) dispatches daytime-down; a manual
workflow_dispatch(action=up|down) covers ad-hoc runs. Each dispatch routes
through uat-run.yaml so it takes the same per-reservation lease as the batch,
and is watched to completion so a failed handoff/teardown surfaces. The daytime
cluster is not a UAT cell — daytime-up stops after deploy, emitting no evidence
bundle and no TestGrid column.

Docs (docs/contributor/uat.md): document the cloud→flavor split, the scheduler
and its cron edges, missed-teardown recovery via DC2's pre-batch guard, and the
out-of-band access path — stable cluster names gated by cloud IAM so no
credential transits CI, with TrainJob submission (AWS) and the Dynamo OpenAI
endpoint port-forward (GCP). The served DynamoGraphDeployment remains DC3's
phase_serve; until it lands the served workload is a documented manual apply.

Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci area/docs area/tests size/XL theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DC2 — Dynamic per-intent cluster acquisition

2 participants