Skip to content

refactor(ci): unify GPU training and inference workflows#579

Merged
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/gpu-ci-symmetry
Apr 15, 2026
Merged

refactor(ci): unify GPU training and inference workflows#579
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/gpu-ci-symmetry

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 15, 2026

Summary

Align the H100 training and inference GPU workflows around leaf-only conformance validation, add Kubeflow training coverage for kind/H100, remove the orphaned non-Kubeflow kind training Chainsaw directory in favor of the Kubeflow-specific suite, harden the Karpenter KWOK build/evidence collection path, make GoReleaser version selection explicit across the remaining composite actions and workflows so callers cannot drift from .settings.yaml, and fail fast on repeated kai-scheduler hook timeouts with targeted diagnostics.

Motivation / Context

This is the follow-up to the GPU conformance dedup work in #577.
It removes the remaining drift between training and inference GPU CI so both workflows follow the same core model: trigger on the deployed leaf recipe's real inputs, install the leaf recipe, run leaf health/resource checks, run aicr validate --phase conformance, and collect a single evidence set. Platform-specific coverage still differs where intended (robust-controller / Kubeflow on training, gateway + Dynamo smoke test on inference).

Additionally, the GPU training test run on this PR revealed that the ko build of the Karpenter KWOK provider can hang for ~47 minutes on a cold Go build cache, consuming the entire job timeout. The conformance evidence step then ran against a dead Kind cluster because its guard (always() && steps.bundle-install.outcome == 'success') did not check whether the validation step had actually executed. This PR adds a hard timeout on ko build, caches both Go module and build artifacts for Karpenter, and tightens the evidence collection guard.

A later GPU training run on this branch also showed that kai-scheduler can spend excessive wall clock in Helm retries on cold runners. The per-attempt 20m timeout is acceptable, but the global retry budget amplifies a single slow hook into a large CI time sink. This PR now caps kai-scheduler to a lower retry budget and emits focused hook diagnostics before retry/fail.

This branch has been rebased on merged PR #580, so the Linux E2E tools/setup-tools fix is already in the base and is no longer part of this PR.

Fixes: N/A
Related: #554, #577, #580

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: GPU CI workflows, chainsaw coverage, Karpenter KWOK build, and build/release tooling wiring

Implementation Notes

  • Adds recipes/overlays/h100-kind-training-kubeflow.yaml and updates the training workflow to install platform: kubeflow.
  • Removes the orphaned tests/chainsaw/ai-conformance/kind-training/ directory by renaming it to kind-training-kubeflow and updating the training workflow and resource assertions to consume only the Kubeflow-specific suite.
  • Aligns the H100 training and inference workflows on the same leaf-only validation model. Training now validates h100-kind-training-kubeflow; inference validates h100-kind-inference-dynamo. The inference Chainsaw leaf suite is renamed from kind/ to kind-inference-dynamo/, reusable Kind-only assertions now live under kind-common/, and both GPU workflows path-gate on a dedicated leaf suite directory plus the shared Kind assertion directory instead of broad catch-all globs.
  • Extends training conformance coverage with robust-controller and secure-accelerator-access, and adds Kubeflow chainsaw/assert coverage for the trainer controller, webhook, and TrainJob CRD.
  • Moves inference to H100 x2, adds gang-scheduling and cluster-autoscaling coverage to the kind inference recipes, and removes dead deployment phase plumbing from the inference workflow.
  • Updates recipe invariant and deployment-order guard tests for the new h100-kind-training-kubeflow path and the expanded kind inference/training conformance checks.
  • Threads goreleaser_version through the remaining composite actions and workflows still installing GoReleaser (setup-build-tools, go-build-release, integration, kwok-test, qualification, on-tag, kwok-recipes) and hardens the action contracts so callers must pass the version explicitly from load-versions instead of silently falling back to a hardcoded default.
  • Adds a runtime guard in setup-build-tools that fails fast if install_goreleaser: 'true' is requested without goreleaser_version.
  • Bumps the remaining in-PR Go references in go.mod and the validator Dockerfiles to 1.26.2.
  • Wraps ko build with a 15-minute hard timeout (KO_BUILD_TIMEOUT env var) in install-karpenter-kwok.sh and adds actions/cache for both ~/go/pkg/mod and ~/.cache/go-build, keyed by runner.os, Go version, and Karpenter version.
  • Adds id: validate-conformance to the validation step in both GPU workflows and gates the evidence collection step on !cancelled() && steps.bundle-install.outcome == 'success' && (steps.validate-conformance.outcome == 'success' || steps.validate-conformance.outcome == 'failure'), so evidence is only collected when validation actually executed.
  • Keeps the existing kai-scheduler 20m Helm timeout, but gives that component a lower retry budget and prints focused kai-scheduler job/pod/event diagnostics before retry/fail so slow hook failures surface actionable signal quickly instead of consuming most of the GPU job wall clock.
  • Updates .github/actions/README.md examples to match the explicit goreleaser_version contract.
  • Follow-up work remains separate for two non-blocking symmetry gaps: normalizing Chainsaw assertion ownership/layout so the Kind GPU suites no longer depend on cluster/*, and adding a training-side workload smoke path to mirror the inference-only Dynamo smoke test.
  • Follow-up cleanup is still expected separately for stale docs outside the actions README.

Testing

bash -n kwok/scripts/install-karpenter-kwok.sh
yamllint \
  .github/workflows/gpu-h100-training-test.yaml \
  .github/workflows/gpu-h100-inference-test.yaml \
  .github/actions/install-karpenter-kwok/action.yml \
  .github/actions/setup-build-tools/action.yml \
  .github/actions/go-build-release/action.yml \
  .github/actions/integration/action.yml \
  .github/actions/kwok-test/action.yml
git diff --check origin/main...HEAD
go test ./pkg/recipe -run 'TestConformanceRecipeInvariants|TestDeploymentOrderGuards'
go test ./pkg/bundler/deployer/helm
  • Before the final GoReleaser contract hardening update, the rebased PR had green tests / Test, tests / Lint, tests / CLI E2E, tests / E2E, and tests / Security Scan checks.
  • The current delta after that point is limited to action YAML / workflow path-gate updates / action docs, plus the kai-scheduler deploy-script retry/diagnostics hardening. CI on this updated PR is the signal for those wired call paths.

Risk Assessment

  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout
  • Low — Isolated change, well-tested, easy to revert

Rollout notes: This changes the inference GPU job to require H100 x2, assumes the corresponding runner class remains available, keeps the remaining Go/GoReleaser references aligned to 1.26.2 / v2.15.3, now enforces explicit GoReleaser version plumbing for the affected action wrappers, and reduces kai-scheduler retry amplification in generated bundle deploy scripts. The ko build timeout assumes GNU timeout (available on Linux GitHub Actions runners; macOS local runs may need gtimeout).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested review from a team as code owners April 15, 2026 00:49
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 61e2e09 to c78e996 Compare April 15, 2026 01:17
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from c78e996 to 4085ee2 Compare April 15, 2026 01:39
@github-actions github-actions bot added size/L and removed size/XL labels Apr 15, 2026
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch 2 times, most recently from 81349f8 to 1eea17c Compare April 15, 2026 14:13
@github-actions github-actions bot added size/XL and removed size/L labels Apr 15, 2026
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch 6 times, most recently from d044606 to 0a1601d Compare April 15, 2026 15:40
@yuanchen8911 yuanchen8911 changed the title fix(ci): make GPU training and inference symmetric fix(ci): align GPU workflows and harden build tooling Apr 15, 2026
@yuanchen8911 yuanchen8911 requested review from mchmarny and xdu31 April 15, 2026 15:43
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 0a1601d to ec98233 Compare April 15, 2026 15:50
@github-actions github-actions bot added size/L and removed size/XL labels Apr 15, 2026
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from ec98233 to d336108 Compare April 15, 2026 17:02
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 93e034b to 97d46b4 Compare April 15, 2026 17:22
@yuanchen8911 yuanchen8911 requested a review from lockwobr April 15, 2026 17:33
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from a93d218 to f28a66d Compare April 15, 2026 17:47
@github-actions github-actions bot added size/XL and removed size/L labels Apr 15, 2026
@yuanchen8911 yuanchen8911 changed the title fix(ci): align GPU workflows and harden build tooling refactor(ci): unify GPU training and inference workflows Apr 15, 2026
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from f28a66d to 76ef0a8 Compare April 15, 2026 17:56
@github-actions github-actions bot added size/L and removed size/XL labels Apr 15, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work tightening CI reliability — the evidence collection guard, ko build timeout, and GoReleaser version pinning from .settings.yaml are all real improvements.

Requesting changes for:

  1. Script injection in setup-build-tools${{ inputs.goreleaser_version }} interpolated directly into bash is attacker-reachable via PR-modified .settings.yaml. Use an env var instead.

  2. dump_component_helm_diagnostics naming — function name is generic but behavior is kai-scheduler-only. Either scope the name or make it generic.

  3. helm_retry silent signature change — new third positional arg could break callers silently. Needs a contract comment at minimum.

  4. Inference path-filter maintenance — 18+ individual file paths vs training's directory globs. Fragile to maintain.

Scope concern: This bundles 6 independent changes (GoReleaser pinning, GPU workflow restructure, Karpenter build hardening, kai-scheduler retry reduction, Kubeflow training coverage, chainsaw directory reorg). Each is individually revertable. Bundling means a revert of e.g. the kai-scheduler retry change also reverts the GoReleaser pinning. Consider splitting — especially since the PR self-describes as "medium risk."

Comment thread .github/actions/setup-build-tools/action.yml
Comment thread pkg/bundler/deployer/helm/templates/deploy.sh.tmpl Outdated
Comment thread pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
Comment thread .github/workflows/gpu-h100-inference-test.yaml
Comment thread .github/workflows/gpu-h100-inference-test.yaml
@github-actions github-actions bot added size/XL and removed size/L labels Apr 15, 2026
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch 2 times, most recently from 394cce3 to c3d4c67 Compare April 15, 2026 18:10
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

yuanchen8911 commented Apr 15, 2026

Review Comment Resolution

# Comment Status
1 Script injection — ${{ inputs.goreleaser_version }} interpolated into bash Fixed — now uses env: GORELEASER_VERSION and references ${GORELEASER_VERSION} in the shell
2 Generic function name — dump_component_helm_diagnostics Fixed — renamed to dump_kai_scheduler_helm_diagnostics
3 helm_retry silent signature change — needs contract comment Fixed — added explicit contract comment documenting helm_retry "<desc>" "<namespace>" "<max_retries>" <command> [args...]
4 Inference path-filter maintenance — 18+ individual paths vs directory globs Not addressed — suggestion to use tests/chainsaw/ai-conformance/cluster/** instead of enumerating files. Appears intentional: inference only triggers on inference-specific assertion files, not all cluster assertions.
5 Scope concern — bundling 6 changes Not addressed — editorial feedback, not a code change request

@yuanchen8911 yuanchen8911 requested a review from mchmarny April 15, 2026 18:18
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from c3d4c67 to e1d258d Compare April 15, 2026 18:35
mchmarny
mchmarny previously approved these changes Apr 15, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch 4 times, most recently from f3f3c0e to 14f4388 Compare April 15, 2026 20:34
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 14f4388 to 3964d03 Compare April 15, 2026 20:37
@yuanchen8911 yuanchen8911 requested a review from mchmarny April 15, 2026 20:48
@mchmarny mchmarny merged commit 066a7ee into NVIDIA:main Apr 15, 2026
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants