refactor(ci): unify GPU training and inference workflows by yuanchen8911 · Pull Request #579 · NVIDIA/aicr

yuanchen8911 · 2026-04-15T00:49:45Z

Summary

Align the H100 training and inference GPU workflows around leaf-only conformance validation, add Kubeflow training coverage for kind/H100, remove the orphaned non-Kubeflow kind training Chainsaw directory in favor of the Kubeflow-specific suite, harden the Karpenter KWOK build/evidence collection path, make GoReleaser version selection explicit across the remaining composite actions and workflows so callers cannot drift from .settings.yaml, and fail fast on repeated kai-scheduler hook timeouts with targeted diagnostics.

Motivation / Context

This is the follow-up to the GPU conformance dedup work in #577.
It removes the remaining drift between training and inference GPU CI so both workflows follow the same core model: trigger on the deployed leaf recipe's real inputs, install the leaf recipe, run leaf health/resource checks, run aicr validate --phase conformance, and collect a single evidence set. Platform-specific coverage still differs where intended (robust-controller / Kubeflow on training, gateway + Dynamo smoke test on inference).

Additionally, the GPU training test run on this PR revealed that the ko build of the Karpenter KWOK provider can hang for ~47 minutes on a cold Go build cache, consuming the entire job timeout. The conformance evidence step then ran against a dead Kind cluster because its guard (always() && steps.bundle-install.outcome == 'success') did not check whether the validation step had actually executed. This PR adds a hard timeout on ko build, caches both Go module and build artifacts for Karpenter, and tightens the evidence collection guard.

A later GPU training run on this branch also showed that kai-scheduler can spend excessive wall clock in Helm retries on cold runners. The per-attempt 20m timeout is acceptable, but the global retry budget amplifies a single slow hook into a large CI time sink. This PR now caps kai-scheduler to a lower retry budget and emits focused hook diagnostics before retry/fail.

This branch has been rebased on merged PR #580, so the Linux E2E tools/setup-tools fix is already in the base and is no longer part of this PR.

Fixes: N/A
Related: #554, #577, #580

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: GPU CI workflows, chainsaw coverage, Karpenter KWOK build, and build/release tooling wiring

Implementation Notes

Adds recipes/overlays/h100-kind-training-kubeflow.yaml and updates the training workflow to install platform: kubeflow.
Removes the orphaned tests/chainsaw/ai-conformance/kind-training/ directory by renaming it to kind-training-kubeflow and updating the training workflow and resource assertions to consume only the Kubeflow-specific suite.
Aligns the H100 training and inference workflows on the same leaf-only validation model. Training now validates h100-kind-training-kubeflow; inference validates h100-kind-inference-dynamo. The inference Chainsaw leaf suite is renamed from kind/ to kind-inference-dynamo/, reusable Kind-only assertions now live under kind-common/, and both GPU workflows path-gate on a dedicated leaf suite directory plus the shared Kind assertion directory instead of broad catch-all globs.
Extends training conformance coverage with robust-controller and secure-accelerator-access, and adds Kubeflow chainsaw/assert coverage for the trainer controller, webhook, and TrainJob CRD.
Moves inference to H100 x2, adds gang-scheduling and cluster-autoscaling coverage to the kind inference recipes, and removes dead deployment phase plumbing from the inference workflow.
Updates recipe invariant and deployment-order guard tests for the new h100-kind-training-kubeflow path and the expanded kind inference/training conformance checks.
Threads goreleaser_version through the remaining composite actions and workflows still installing GoReleaser (setup-build-tools, go-build-release, integration, kwok-test, qualification, on-tag, kwok-recipes) and hardens the action contracts so callers must pass the version explicitly from load-versions instead of silently falling back to a hardcoded default.
Adds a runtime guard in setup-build-tools that fails fast if install_goreleaser: 'true' is requested without goreleaser_version.
Bumps the remaining in-PR Go references in go.mod and the validator Dockerfiles to 1.26.2.
Wraps ko build with a 15-minute hard timeout (KO_BUILD_TIMEOUT env var) in install-karpenter-kwok.sh and adds actions/cache for both ~/go/pkg/mod and ~/.cache/go-build, keyed by runner.os, Go version, and Karpenter version.
Adds id: validate-conformance to the validation step in both GPU workflows and gates the evidence collection step on !cancelled() && steps.bundle-install.outcome == 'success' && (steps.validate-conformance.outcome == 'success' || steps.validate-conformance.outcome == 'failure'), so evidence is only collected when validation actually executed.
Keeps the existing kai-scheduler 20m Helm timeout, but gives that component a lower retry budget and prints focused kai-scheduler job/pod/event diagnostics before retry/fail so slow hook failures surface actionable signal quickly instead of consuming most of the GPU job wall clock.
Updates .github/actions/README.md examples to match the explicit goreleaser_version contract.
Follow-up work remains separate for two non-blocking symmetry gaps: normalizing Chainsaw assertion ownership/layout so the Kind GPU suites no longer depend on cluster/*, and adding a training-side workload smoke path to mirror the inference-only Dynamo smoke test.
Follow-up cleanup is still expected separately for stale docs outside the actions README.

Testing

bash -n kwok/scripts/install-karpenter-kwok.sh
yamllint \
  .github/workflows/gpu-h100-training-test.yaml \
  .github/workflows/gpu-h100-inference-test.yaml \
  .github/actions/install-karpenter-kwok/action.yml \
  .github/actions/setup-build-tools/action.yml \
  .github/actions/go-build-release/action.yml \
  .github/actions/integration/action.yml \
  .github/actions/kwok-test/action.yml
git diff --check origin/main...HEAD
go test ./pkg/recipe -run 'TestConformanceRecipeInvariants|TestDeploymentOrderGuards'
go test ./pkg/bundler/deployer/helm

Before the final GoReleaser contract hardening update, the rebased PR had green tests / Test, tests / Lint, tests / CLI E2E, tests / E2E, and tests / Security Scan checks.
The current delta after that point is limited to action YAML / workflow path-gate updates / action docs, plus the kai-scheduler deploy-script retry/diagnostics hardening. CI on this updated PR is the signal for those wired call paths.

Risk Assessment

Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout
Low — Isolated change, well-tested, easy to revert

Rollout notes: This changes the inference GPU job to require H100 x2, assumes the corresponding runner class remains available, keeps the remaining Go/GoReleaser references aligned to 1.26.2 / v2.15.3, now enforces explicit GoReleaser version plumbing for the affected action wrappers, and reduces kai-scheduler retry amplification in generated bundle deploy scripts. The ko build timeout assumes GNU timeout (available on Linux GitHub Actions runners; macOS local runs may need gtimeout).

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

mchmarny

Good work tightening CI reliability — the evidence collection guard, ko build timeout, and GoReleaser version pinning from .settings.yaml are all real improvements.

Requesting changes for:

Script injection in setup-build-tools — ${{ inputs.goreleaser_version }} interpolated directly into bash is attacker-reachable via PR-modified .settings.yaml. Use an env var instead.
dump_component_helm_diagnostics naming — function name is generic but behavior is kai-scheduler-only. Either scope the name or make it generic.
helm_retry silent signature change — new third positional arg could break callers silently. Needs a contract comment at minimum.
Inference path-filter maintenance — 18+ individual file paths vs training's directory globs. Fragile to maintain.

Scope concern: This bundles 6 independent changes (GoReleaser pinning, GPU workflow restructure, Karpenter build hardening, kai-scheduler retry reduction, Kubeflow training coverage, chainsaw directory reorg). Each is individually revertable. Bundling means a revert of e.g. the kai-scheduler retry change also reverts the GoReleaser pinning. Consider splitting — especially since the PR self-describes as "medium risk."

yuanchen8911 · 2026-04-15T18:14:10Z

Review Comment Resolution

#	Comment	Status
1	Script injection — `${{ inputs.goreleaser_version }}` interpolated into bash	Fixed — now uses `env: GORELEASER_VERSION` and references `${GORELEASER_VERSION}` in the shell
2	Generic function name — `dump_component_helm_diagnostics`	Fixed — renamed to `dump_kai_scheduler_helm_diagnostics`
3	`helm_retry` silent signature change — needs contract comment	Fixed — added explicit contract comment documenting `helm_retry "<desc>" "<namespace>" "<max_retries>" <command> [args...]`
4	Inference path-filter maintenance — 18+ individual paths vs directory globs	Not addressed — suggestion to use `tests/chainsaw/ai-conformance/cluster/**` instead of enumerating files. Appears intentional: inference only triggers on inference-specific assertion files, not all cluster assertions.
5	Scope concern — bundling 6 changes	Not addressed — editorial feedback, not a code change request

mchmarny

/lgtm

yuanchen8911 requested review from a team as code owners April 15, 2026 00:49

github-actions bot added area/recipes area/ci area/tests size/XL labels Apr 15, 2026

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 61e2e09 to c78e996 Compare April 15, 2026 01:17

yuanchen8911 mentioned this pull request Apr 15, 2026

fix(ci): pin e2e goreleaser and exclude local build artifacts #580

Merged

25 tasks

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from c78e996 to 4085ee2 Compare April 15, 2026 01:39

github-actions bot added size/L and removed size/XL labels Apr 15, 2026

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch 2 times, most recently from 81349f8 to 1eea17c Compare April 15, 2026 14:13

github-actions bot added size/XL and removed size/L labels Apr 15, 2026

yuanchen8911 mentioned this pull request Apr 15, 2026

ci: pre-build Karpenter KWOK provider image to eliminate ko build in GPU workflows #584

Open

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch 6 times, most recently from d044606 to 0a1601d Compare April 15, 2026 15:40

yuanchen8911 changed the title ~~fix(ci): make GPU training and inference symmetric~~ fix(ci): align GPU workflows and harden build tooling Apr 15, 2026

yuanchen8911 requested review from mchmarny and xdu31 April 15, 2026 15:43

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 0a1601d to ec98233 Compare April 15, 2026 15:50

github-actions bot added size/L and removed size/XL labels Apr 15, 2026

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from ec98233 to d336108 Compare April 15, 2026 17:02

github-actions bot added the area/bundler label Apr 15, 2026

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 93e034b to 97d46b4 Compare April 15, 2026 17:22

yuanchen8911 requested a review from lockwobr April 15, 2026 17:33

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from a93d218 to f28a66d Compare April 15, 2026 17:47

github-actions bot added size/XL and removed size/L labels Apr 15, 2026

yuanchen8911 changed the title ~~fix(ci): align GPU workflows and harden build tooling~~ refactor(ci): unify GPU training and inference workflows Apr 15, 2026

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from f28a66d to 76ef0a8 Compare April 15, 2026 17:56

github-actions bot added size/L and removed size/XL labels Apr 15, 2026

mchmarny requested changes Apr 15, 2026

View reviewed changes

mchmarny assigned yuanchen8911 Apr 15, 2026

github-actions bot added size/XL and removed size/L labels Apr 15, 2026

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch 2 times, most recently from 394cce3 to c3d4c67 Compare April 15, 2026 18:10

yuanchen8911 requested a review from mchmarny April 15, 2026 18:18

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from c3d4c67 to e1d258d Compare April 15, 2026 18:35

mchmarny previously approved these changes Apr 15, 2026

View reviewed changes

yuanchen8911 dismissed mchmarny’s stale review via 621e01f April 15, 2026 19:42

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch 4 times, most recently from f3f3c0e to 14f4388 Compare April 15, 2026 20:34

refactor(ci): unify GPU training and inference workflows

3964d03

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 14f4388 to 3964d03 Compare April 15, 2026 20:37

yuanchen8911 requested a review from mchmarny April 15, 2026 20:48

mchmarny approved these changes Apr 15, 2026

View reviewed changes

mchmarny merged commit 066a7ee into NVIDIA:main Apr 15, 2026
53 checks passed

yuanchen8911 mentioned this pull request Apr 15, 2026

refactor(ci): unify GPU Chainsaw layout and validation flow #587

Merged

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(ci): unify GPU training and inference workflows#579

refactor(ci): unify GPU training and inference workflows#579
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/gpu-ci-symmetry

yuanchen8911 commented Apr 15, 2026 •

edited

Loading

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Apr 15, 2026 •

edited

Loading

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Comment Resolution

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Apr 15, 2026 •

edited

Loading

yuanchen8911 commented Apr 15, 2026 •

edited

Loading