feat(ci): per-intent + daytime-lifecycle UAT cluster acquisition by njhensley · Pull Request #1586 · NVIDIA/aicr

njhensley · 2026-07-01T23:58:55Z

Summary

Acquire a UAT cluster shaped by {reservation, intent} through the reservation lease, and add the two cluster lifecycles a reservation can carry — the nightly per-run cluster and the long-lived daytime deployment — with an enforced cleanup boundary between them.

Motivation / Context

Implements DC2. AWS previously hard-failed on a busy reservation in a race-and-fail pre-flight; the test config and recipe intent were hard-coded to training; and there was no mechanism for the morning-handoff daytime cluster or a guard preventing the nightly batch from racing an un-torn-down daytime deployment. This builds on DC1's reservation lease and dispatch surface.

Fixes: #1275
Related: #1264, #1274

Type of Change

New feature (non-breaking change that adds functionality)
Documentation update
Build/CI/tooling

Component(s) Affected

Docs/examples (docs/, examples/)
Other: UAT CI workflows (.github/workflows/uat-*.yaml) and UAT test configs (tests/uat/**)

Implementation Notes

Per-intent selection. intent input (training|inference) on uat-run.yaml, forwarded to the cloud pipelines, selects tests/uat/<cloud>/tests/h100-<intent>-config.yaml and the evidence-ingest recipe name. New inference sibling configs (intent: inference / platform: dynamo). Both intents drive the same cluster-config.yaml (GPU pool from the reservation, system/CPU pools dynamic). The training TrainJob CUJ is gated to intent=training; served inference (phase_serve) is DC3's.
AWS capacity → post-lease assertion. Now asserts TotalInstanceCount >= desired (the reservation's fixed size) rather than AvailableInstanceCount, so contention no longer hard-fails (the lease is the gate) but a genuinely undersized reservation still does.
GCP capacity — decided posture. No symmetric pre-flight check; GCP relies on the GKE actuator failing at provision time. Documented in docs/contributor/uat.md and noted in uat-gcp.yaml (no phantom step).
Daytime lifecycle. lifecycle input (nightly|daytime-up|daytime-down). Daytime modes use a stable, reservation-tagged deployment.id so teardown and the guard find the held cluster across runs (Terraform state is remote, keyed by deployment.id). daytime-up provisions + deploys + holds; daytime-down tears the held cluster down.
Pre-batch guard. Every nightly run refuses to provision into a reservation that still holds an un-torn-down daytime cluster (detected by the stable cluster name), failing fast and fail-closed rather than racing.

Testing

# Config resolution (recipe + bundle) for both new inference configs
aicr recipe --config tests/uat/aws/tests/h100-inference-config.yaml
aicr bundle --config tests/uat/aws/tests/h100-inference-config.yaml   # + gcp
# Workflow + docs lint
yamllint -c .yamllint.yaml .github/workflows/uat-*.yaml
actionlint -shellcheck= .github/workflows/uat-*.yaml
./tools/check-docs-filenames && ./tools/check-docs-mdx

Both inference configs resolve (recipe + bundle produce a valid bundle/helmfile.yaml); validate --no-cluster only needs a runtime snapshot. yamllint, actionlint (pinned v1.7.11, matching the merge gate's -shellcheck=), and the docs filename/MDX checks pass. No Go changes. Real UAT (provision on live GPU hardware) is not runnable locally and is exercised by dispatching the workflow.

Risk Assessment

Low — CI/config/docs only, no Go changes, easy to revert; new lifecycles default to the existing nightly behavior.

Rollout notes: intent and lifecycle default to training/nightly, so the nightly cron and existing dispatches are unchanged. The daytime lifecycle's workload content and access distribution are DC8's and layer on top of the daytime-up mechanic shipped here.

Checklist

Tests pass locally (make test with -race) — N/A, no Go changes
Linter passes (make lint) — yamllint + actionlint + docs checks pass
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — new inference AICRConfigs (validated via recipe/bundle resolution + existing cuj2-inference chainsaw tests)
I updated docs if user-facing behavior changed — docs/contributor/uat.md
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

Implements DC2 (NVIDIA#1275): acquire a UAT cluster shaped by {reservation, intent} through the reservation lease, and add the two cluster lifecycles a reservation can carry (nightly per-run vs long-lived daytime) with an enforced cleanup boundary between them. Per-intent selection: add an intent input (training|inference) to uat-run.yaml, forwarded to the cloud pipelines, selecting tests/uat/<cloud>/tests/ h100-<intent>-config.yaml and the evidence-ingest recipe name. Add the inference sibling configs (criteria intent=inference/platform=dynamo). Both intents drive the same cluster-config (GPU pool from the reservation, system/CPU pools dynamic); the training TrainJob CUJ is gated to intent=training since served inference (phase_serve) is DC3's. AWS capacity: convert the race-and-fail pre-flight into a post-lease assertion on TotalInstanceCount (the reservation's fixed size) rather than AvailableInstanceCount, so contention no longer fails (the lease is the gate) but a genuinely undersized reservation still does. GCP capacity: record the decision to rely on the GKE actuator failing at provision time (no symmetric pre-flight check), documented in docs/contributor/uat.md and noted in uat-gcp.yaml. Daytime lifecycle: add a lifecycle input (nightly|daytime-up|daytime-down). daytime-* use a stable, reservation-tagged deployment.id so the evening teardown and the pre-batch guard can find the held cluster across runs (state is remote, keyed by deployment.id). daytime-up provisions, deploys, and holds; daytime-down tears the held cluster down. Pre-batch guard: every nightly run refuses to provision into a reservation that still holds an un-torn-down daytime cluster (detected by the stable cluster name), failing fast and fail-closed rather than racing the reservation. Signed-off-by: Nathan Hensley <nhensley@nvidia.com>

github-actions · 2026-07-01T23:59:56Z

🌿 Preview your docs: https://nvidia-preview-ci-dc2-per-intent-acquisition.docs.buildwithfern.com/aicr

coderabbitai · 2026-07-02T00:08:48Z

📝 Walkthrough

Walkthrough

This PR introduces intent and lifecycle dials across the UAT workflow suite (uat-run.yaml, uat-aws.yaml, uat-gcp.yaml). It adds input validation, lifecycle-based provisioning/teardown gating (nightly, daytime-up, daytime-down), a pre-batch guard checking for existing daytime clusters, a reworked AWS capacity assertion using total instance count, intent-driven test config and evidence recipe selection, expanded Test Summary output, new H100 inference test config YAML files for AWS and GCP, and updated contributor documentation.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related PRs

NVIDIA/aicr#1569: Prior refactor of uat-run.yaml dispatch inputs and uat-aws.yaml/uat-gcp.yaml config selection that this PR's intent/lifecycle wiring builds upon.

Suggested labels: area/infra

Suggested reviewers: mchmarny, xdu31

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: per-intent UAT cluster acquisition with daytime lifecycles.
Description check	✅ Passed	The description matches the workflow, config, and docs changes and explains the new intent and lifecycle behavior.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

.github/workflows/uat-gcp.yaml (1)

162-189: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Keep daytime-down independent of build/image setup.

Line 165/178/189 skip build and image work, but daytime-down still runs unrelated setup/auth steps before teardown. A failure in Go setup, GHCR login, initial GCP auth/setup-gcloud, or unnecessary tool installs can block the cleanup path.

Suggested tightening

       - name: Setup Go
-        if: inputs.aicr_version == '' && inputs.skip_tests != true
+        if: inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down'

       - name: Authenticate to GHCR
+        if: inputs.lifecycle != 'daytime-down'
         uses: ./.github/actions/ghcr-login

       - name: Authenticate to GCP
         id: auth
+        if: inputs.lifecycle != 'daytime-down'
         uses: google-github-actions/auth@7c6bc770dae815cd3e89ee6cdf493a5fab2cc093  # v3.0.0

       - name: Setup gcloud
+        if: inputs.lifecycle != 'daytime-down'
         uses: google-github-actions/setup-gcloud@aa5489c8933f4cc7a4f7d45035b3b1440c9c10db  # v3.0.1

-          install_kubectl: 'true'
+          install_kubectl: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
           install_yq: 'true'
-          install_helm: 'true'
+          install_helm: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
-          install_helmfile: 'true'
+          install_helmfile: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
-          install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && 'true' || 'false' }}
+          install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-gcp.yaml around lines 162 - 189, Keep the
`daytime-down` path isolated from non-teardown setup: update the workflow around
the `Build aicr binary`, `Install released aicr`, `Authenticate to GHCR`, and
other bootstrap steps so they are also skipped when `inputs.lifecycle ==
'daytime-down'`. Make sure the teardown job only runs the cleanup/auth needed
for `daytime-down`, and avoid any Go build, release install, GHCR login, or
image-related preparation in that path.

.github/workflows/uat-aws.yaml (1)

164-191: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Keep daytime-down independent of build/image setup.

Line 167/180/191 skip the heavy build/push steps, but a daytime-down run still executes unrelated setup such as Go setup, GHCR login, initial AWS auth, and ko installation before teardown. Any failure there can prevent config update/tool setup and turn an explicit cleanup run into a leaked cluster.

Suggested tightening

       - name: Setup Go
-        if: inputs.aicr_version == '' && inputs.skip_tests != true
+        if: inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down'

       - name: Authenticate to GHCR
+        if: inputs.lifecycle != 'daytime-down'
         uses: ./.github/actions/ghcr-login

       - name: Configure AWS credentials
         id: auth
+        if: inputs.lifecycle != 'daytime-down'
         uses: aws-actions/configure-aws-credentials@254c19bd240aabef8777f48595e9d2d7b972184b  # v6.2.1

-          install_kubectl: 'true'
+          install_kubectl: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
           install_yq: 'true'
-          install_helm: 'true'
+          install_helm: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
-          install_helmfile: 'true'
+          install_helmfile: ${{ inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}
-          install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && 'true' || 'false' }}
+          install_ko: ${{ inputs.aicr_version == '' && inputs.skip_tests != true && inputs.lifecycle != 'daytime-down' && 'true' || 'false' }}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-aws.yaml around lines 164 - 191, The daytime-down path
is still entering unrelated setup before teardown, which can block cleanup if
those steps fail. Update the workflow conditions around the setup sequence in
uat-aws.yaml so the daytime-down lifecycle skips nonessential prep like Go
setup, GHCR login, AWS auth, and ko installation, not just the build/push jobs.
Use the existing lifecycle checks on the same steps around Build aicr binary,
Install released aicr, Authenticate to GHCR, and the AWS/ko setup blocks to keep
daytime-down independent and teardown-only.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/uat-aws.yaml:
- Around line 516-530: The summary step in the workflow is embedding
github.ref_name directly inside a double-quoted shell echo/printf command, which
can allow shell metacharacters in the ref name to be interpreted. Move the ref
name into an env variable for this step and print that variable as data in the
summary block, using the existing run section that writes the UAT Results and
references github.sha/github.ref_name.

In @.github/workflows/uat-gcp.yaml:
- Around line 477-491: The UAT summary step is interpolating github.ref_name
directly inside the shell block, which should be treated as data instead of
inline shell text. Update the summary step in the workflow job that prints “UAT
Results (GCP)” to pass github.ref_name through env alongside the other SUMMARY_*
values, then use printf (or equivalent) to render the branch name in the build
line so the ref name is not shell-expanded.

---

Outside diff comments:
In @.github/workflows/uat-aws.yaml:
- Around line 164-191: The daytime-down path is still entering unrelated setup
before teardown, which can block cleanup if those steps fail. Update the
workflow conditions around the setup sequence in uat-aws.yaml so the
daytime-down lifecycle skips nonessential prep like Go setup, GHCR login, AWS
auth, and ko installation, not just the build/push jobs. Use the existing
lifecycle checks on the same steps around Build aicr binary, Install released
aicr, Authenticate to GHCR, and the AWS/ko setup blocks to keep daytime-down
independent and teardown-only.

In @.github/workflows/uat-gcp.yaml:
- Around line 162-189: Keep the `daytime-down` path isolated from non-teardown
setup: update the workflow around the `Build aicr binary`, `Install released
aicr`, `Authenticate to GHCR`, and other bootstrap steps so they are also
skipped when `inputs.lifecycle == 'daytime-down'`. Make sure the teardown job
only runs the cleanup/auth needed for `daytime-down`, and avoid any Go build,
release install, GHCR login, or image-related preparation in that path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: fc868d63-8b16-4958-a737-41ef8f14c930

📥 Commits

Reviewing files that changed from the base of the PR and between 83d9825 and eec2a1b.

📒 Files selected for processing (6)

.github/workflows/uat-aws.yaml
.github/workflows/uat-gcp.yaml
.github/workflows/uat-run.yaml
docs/contributor/uat.md
tests/uat/aws/tests/h100-inference-config.yaml
tests/uat/gcp/tests/h100-inference-config.yaml

coderabbitai · 2026-07-02T00:08:51Z

+          SUMMARY_INTENT: ${{ inputs.intent }}
+          SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
        run: |
          {
            echo "## UAT Results (AWS)"
            echo ""
-            printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION"
+            printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \
+              "$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"
+            printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"
            printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"
            echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
            echo ""
            echo "| Phase | Status |"
            echo "|-------|--------|"
+            echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Avoid shell-expanding github.ref_name in the summary step.

Line 526 injects github.ref_name directly into a double-quoted shell command. Branch/ref names can contain shell metacharacters, so pass it through env and print it as data instead.

Suggested fix

env: SUMMARY_RESERVATION: ${{ inputs.reservation }} SUMMARY_AICR_VERSION: ${{ inputs.aicr_version }} SUMMARY_INTENT: ${{ inputs.intent }} SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }} + SUMMARY_SHA: ${{ github.sha }} + SUMMARY_REF_NAME: ${{ github.ref_name }} run: | { echo "## UAT Results (AWS)" @@ - echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)" + printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SUMMARY_INTENT: ${{ inputs.intent }}

SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}

run: |

{

echo "## UAT Results (AWS)"

echo ""

printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION"

printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \

"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"

printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"

printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"

echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"

echo ""

echo "| Phase | Status |"

echo "|-------|--------|"

echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"

SUMMARY_INTENT: ${{ inputs.intent }}

SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}

SUMMARY_SHA: ${{ github.sha }}

SUMMARY_REF_NAME: ${{ github.ref_name }}

run: |

{

echo "## UAT Results (AWS)"

echo ""

printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \

"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"

printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"

printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"

printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"

echo ""

echo "| Phase | Status |"

echo "|-------|--------|"

echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"

🧰 Tools

🪛 zizmor (1.26.1)

[warning] 526-526: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[error] 526-526: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[info] 530-530: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/uat-aws.yaml around lines 516 - 530, The summary step in the workflow is embedding github.ref_name directly inside a double-quoted shell echo/printf command, which can allow shell metacharacters in the ref name to be interpreted. Move the ref name into an env variable for this step and print that variable as data in the summary block, using the existing run section that writes the UAT Results and references github.sha/github.ref_name.

Source: Linters/SAST tools

coderabbitai · 2026-07-02T00:08:51Z

+          SUMMARY_INTENT: ${{ inputs.intent }}
+          SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}
        run: |
          {
            echo "## UAT Results (GCP)"
            echo ""
-            printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION"
+            printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \
+              "$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"
+            printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"
            printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"
            echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"
            echo ""
            echo "| Phase | Status |"
            echo "|-------|--------|"
+            echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Avoid shell-expanding github.ref_name in the summary step.

Line 487 injects github.ref_name directly into a double-quoted shell command. Pass it via env and print with printf so the ref name is treated as data.

Suggested fix

env: SUMMARY_RESERVATION: ${{ inputs.reservation }} SUMMARY_AICR_VERSION: ${{ inputs.aicr_version }} SUMMARY_INTENT: ${{ inputs.intent }} SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }} + SUMMARY_SHA: ${{ github.sha }} + SUMMARY_REF_NAME: ${{ github.ref_name }} run: | { echo "## UAT Results (GCP)" @@ - echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)" + printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SUMMARY_INTENT: ${{ inputs.intent }}

SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}

run: |

{

echo "## UAT Results (GCP)"

echo ""

printf '**Reservation:** `%s`\n' "$SUMMARY_RESERVATION"

printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \

"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"

printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"

printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"

echo "**Build:** \`${{ github.sha }}\` (branch: \`${{ github.ref_name }}\`)"

echo ""

echo "| Phase | Status |"

echo "|-------|--------|"

echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"

SUMMARY_INTENT: ${{ inputs.intent }}

SUMMARY_LIFECYCLE: ${{ inputs.lifecycle }}

SUMMARY_SHA: ${{ github.sha }}

SUMMARY_REF_NAME: ${{ github.ref_name }}

run: |

{

echo "## UAT Results (GCP)"

echo ""

printf '**Reservation:** `%s` · **Intent:** `%s` · **Lifecycle:** `%s`\n' \

"$SUMMARY_RESERVATION" "$SUMMARY_INTENT" "$SUMMARY_LIFECYCLE"

printf '**Cluster:** `%s`\n' "$DEPLOYMENT_ID"

printf '**AICR version:** `%s`\n' "${SUMMARY_AICR_VERSION:-main (build from source)}"

printf '**Build:** `%s` (branch: `%s`)\n' "$SUMMARY_SHA" "$SUMMARY_REF_NAME"

echo ""

echo "| Phase | Status |"

echo "|-------|--------|"

echo "| Pre-batch guard | ${{ steps.guard.outcome }} |"

🧰 Tools

🪛 zizmor (1.26.1)

[warning] 487-487: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[error] 487-487: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[info] 491-491: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/uat-gcp.yaml around lines 477 - 491, The UAT summary step is interpolating github.ref_name directly inside the shell block, which should be treated as data instead of inline shell text. Update the summary step in the workflow job that prints “UAT Results (GCP)” to pass github.ref_name through env alongside the other SUMMARY_* values, then use printf (or equivalent) to render the branch name in the build line so the ref name is not shell-expanded.

Source: Linters/SAST tools

Implements DC8 (NVIDIA#1281): stand up one long-lived, human-facing deployment per cloud for the working day, then tear it down before the nightly batch. DC2 (NVIDIA#1586) already owns the daytime-up/daytime-down lifecycle, provision- and-hold, teardown, and pre-batch guard; DC8 adds the orchestration on top — the cloud→flavor data mapping, the scheduler, and the out-of-band access path. Cloud→flavor split (data, not code): add an optional daytime-intent column to the reservation registry (aws-h100=training, gcp-h100=inference at launch; empty = nightly-batch only). pkg/uatbroker gains the DaytimeIntent field, intent constants + validation, and DaytimeAssignments(); uat-broker gains a `reservations --daytime` JSON matrix output. Only one daytime reservation per cloud is allowed (a reservation cannot hold both a held daytime cluster and the nightly batch at once) — enforced by the committed-registry test. Scheduler (uat-daytime.yaml): a thin scheduler over the daytime-up/daytime-down mechanics. A morning cron dispatches daytime-up per rotation reservation, an evening cron (before the 04:00 batch) dispatches daytime-down; a manual workflow_dispatch(action=up|down) covers ad-hoc runs. Each dispatch routes through uat-run.yaml so it takes the same per-reservation lease as the batch, and is watched to completion so a failed handoff/teardown surfaces. The daytime cluster is not a UAT cell — daytime-up stops after deploy, emitting no evidence bundle and no TestGrid column. Docs (docs/contributor/uat.md): document the cloud→flavor split, the scheduler and its cron edges, missed-teardown recovery via DC2's pre-batch guard, and the out-of-band access path — stable cluster names gated by cloud IAM so no credential transits CI, with TrainJob submission (AWS) and the Dynamo OpenAI endpoint port-forward (GCP). The served DynamoGraphDeployment remains DC3's phase_serve; until it lands the served workload is a documented manual apply. Signed-off-by: Nathan Hensley <nhensley@nvidia.com>

njhensley requested review from a team as code owners July 1, 2026 23:58

github-actions Bot added area/ci area/tests area/docs labels Jul 1, 2026

njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jul 1, 2026

github-actions Bot added the size/XL label Jul 1, 2026

mchmarny approved these changes Jul 2, 2026

View reviewed changes

njhensley enabled auto-merge (squash) July 2, 2026 00:07

njhensley merged commit 55fd14f into NVIDIA:main Jul 2, 2026
36 of 37 checks passed

njhensley deleted the ci/dc2-per-intent-acquisition branch July 2, 2026 00:08

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

njhensley mentioned this pull request Jul 2, 2026

feat(ci): daytime human-access deployment scheduler (#1281) #1587

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ci): per-intent + daytime-lifecycle UAT cluster acquisition#1586

feat(ci): per-intent + daytime-lifecycle UAT cluster acquisition#1586
njhensley merged 1 commit into
NVIDIA:mainfrom
njhensley:ci/dc2-per-intent-acquisition

njhensley commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Uh oh!

coderabbitai Bot commented Jul 2, 2026

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhensley commented Jul 1, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Uh oh!

coderabbitai Bot commented Jul 2, 2026

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants