refactor(ci): unify GPU training and inference workflows#579
refactor(ci): unify GPU training and inference workflows#579mchmarny merged 1 commit intoNVIDIA:mainfrom
Conversation
61e2e09 to
c78e996
Compare
c78e996 to
4085ee2
Compare
81349f8 to
1eea17c
Compare
d044606 to
0a1601d
Compare
0a1601d to
ec98233
Compare
ec98233 to
d336108
Compare
93e034b to
97d46b4
Compare
a93d218 to
f28a66d
Compare
f28a66d to
76ef0a8
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Good work tightening CI reliability — the evidence collection guard, ko build timeout, and GoReleaser version pinning from .settings.yaml are all real improvements.
Requesting changes for:
-
Script injection in
setup-build-tools—${{ inputs.goreleaser_version }}interpolated directly into bash is attacker-reachable via PR-modified.settings.yaml. Use an env var instead. -
dump_component_helm_diagnosticsnaming — function name is generic but behavior is kai-scheduler-only. Either scope the name or make it generic. -
helm_retrysilent signature change — new third positional arg could break callers silently. Needs a contract comment at minimum. -
Inference path-filter maintenance — 18+ individual file paths vs training's directory globs. Fragile to maintain.
Scope concern: This bundles 6 independent changes (GoReleaser pinning, GPU workflow restructure, Karpenter build hardening, kai-scheduler retry reduction, Kubeflow training coverage, chainsaw directory reorg). Each is individually revertable. Bundling means a revert of e.g. the kai-scheduler retry change also reverts the GoReleaser pinning. Consider splitting — especially since the PR self-describes as "medium risk."
394cce3 to
c3d4c67
Compare
Review Comment Resolution
|
c3d4c67 to
e1d258d
Compare
f3f3c0e to
14f4388
Compare
14f4388 to
3964d03
Compare
Summary
Align the H100 training and inference GPU workflows around leaf-only conformance validation, add Kubeflow training coverage for kind/H100, remove the orphaned non-Kubeflow kind training Chainsaw directory in favor of the Kubeflow-specific suite, harden the Karpenter KWOK build/evidence collection path, make GoReleaser version selection explicit across the remaining composite actions and workflows so callers cannot drift from
.settings.yaml, and fail fast on repeatedkai-schedulerhook timeouts with targeted diagnostics.Motivation / Context
This is the follow-up to the GPU conformance dedup work in #577.
It removes the remaining drift between training and inference GPU CI so both workflows follow the same core model: trigger on the deployed leaf recipe's real inputs, install the leaf recipe, run leaf health/resource checks, run
aicr validate --phase conformance, and collect a single evidence set. Platform-specific coverage still differs where intended (robust-controller/ Kubeflow on training, gateway + Dynamo smoke test on inference).Additionally, the GPU training test run on this PR revealed that the
ko buildof the Karpenter KWOK provider can hang for ~47 minutes on a cold Go build cache, consuming the entire job timeout. The conformance evidence step then ran against a dead Kind cluster because its guard (always() && steps.bundle-install.outcome == 'success') did not check whether the validation step had actually executed. This PR adds a hard timeout onko build, caches both Go module and build artifacts for Karpenter, and tightens the evidence collection guard.A later GPU training run on this branch also showed that
kai-schedulercan spend excessive wall clock in Helm retries on cold runners. The per-attempt20mtimeout is acceptable, but the global retry budget amplifies a single slow hook into a large CI time sink. This PR now capskai-schedulerto a lower retry budget and emits focused hook diagnostics before retry/fail.This branch has been rebased on merged PR #580, so the Linux E2E
tools/setup-toolsfix is already in the base and is no longer part of this PR.Fixes: N/A
Related: #554, #577, #580
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
recipes/overlays/h100-kind-training-kubeflow.yamland updates the training workflow to installplatform: kubeflow.tests/chainsaw/ai-conformance/kind-training/directory by renaming it tokind-training-kubeflowand updating the training workflow and resource assertions to consume only the Kubeflow-specific suite.h100-kind-training-kubeflow; inference validatesh100-kind-inference-dynamo. The inference Chainsaw leaf suite is renamed fromkind/tokind-inference-dynamo/, reusable Kind-only assertions now live underkind-common/, and both GPU workflows path-gate on a dedicated leaf suite directory plus the shared Kind assertion directory instead of broad catch-all globs.robust-controllerandsecure-accelerator-access, and adds Kubeflow chainsaw/assert coverage for the trainer controller, webhook, andTrainJobCRD.H100 x2, addsgang-schedulingandcluster-autoscalingcoverage to the kind inference recipes, and removes deaddeploymentphase plumbing from the inference workflow.h100-kind-training-kubeflowpath and the expanded kind inference/training conformance checks.goreleaser_versionthrough the remaining composite actions and workflows still installing GoReleaser (setup-build-tools,go-build-release,integration,kwok-test,qualification,on-tag,kwok-recipes) and hardens the action contracts so callers must pass the version explicitly fromload-versionsinstead of silently falling back to a hardcoded default.setup-build-toolsthat fails fast ifinstall_goreleaser: 'true'is requested withoutgoreleaser_version.go.modand the validator Dockerfiles to1.26.2.ko buildwith a 15-minute hard timeout (KO_BUILD_TIMEOUTenv var) ininstall-karpenter-kwok.shand addsactions/cachefor both~/go/pkg/modand~/.cache/go-build, keyed byrunner.os, Go version, and Karpenter version.id: validate-conformanceto the validation step in both GPU workflows and gates the evidence collection step on!cancelled() && steps.bundle-install.outcome == 'success' && (steps.validate-conformance.outcome == 'success' || steps.validate-conformance.outcome == 'failure'), so evidence is only collected when validation actually executed.kai-scheduler20mHelm timeout, but gives that component a lower retry budget and prints focusedkai-schedulerjob/pod/event diagnostics before retry/fail so slow hook failures surface actionable signal quickly instead of consuming most of the GPU job wall clock..github/actions/README.mdexamples to match the explicitgoreleaser_versioncontract.cluster/*, and adding a training-side workload smoke path to mirror the inference-only Dynamo smoke test.Testing
tests / Test,tests / Lint,tests / CLI E2E,tests / E2E, andtests / Security Scanchecks.kai-schedulerdeploy-script retry/diagnostics hardening. CI on this updated PR is the signal for those wired call paths.Risk Assessment
Rollout notes: This changes the inference GPU job to require
H100 x2, assumes the corresponding runner class remains available, keeps the remaining Go/GoReleaser references aligned to1.26.2/v2.15.3, now enforces explicit GoReleaser version plumbing for the affected action wrappers, and reduceskai-schedulerretry amplification in generated bundle deploy scripts. Theko buildtimeout assumes GNUtimeout(available on Linux GitHub Actions runners; macOS local runs may needgtimeout).Checklist
make testwith-race)make lint)git commit -S) — GPG signing info