fix(bundler): replace placeholder GitRepository with ArtifactGenerator for Flux OCI#1017
Conversation
…chart HelmReleases referenced a placeholder GitRepository that stalls under OCI consumption. Emit ArtifactGenerator + ExternalArtifact pairs via spec.chartRef instead, add replace/trunc to manifest template funcs for helm.sh/chart label sanitization, and enable full HelmRelease reconciliation in the KWOK flux-oci lane. Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@kwok/scripts/install-infra.sh`:
- Around line 475-487: The patch currently always appends
--feature-gates=ExternalArtifact=true which causes repeated churn; modify the
block so it first checks the helm-controller deployment args (using kc get
deployment helm-controller -n "${FLUX_NAMESPACE}" -o json and inspecting
.spec.template.spec.containers[0].args or via jq/jsonpath) for the exact flag
and only run kc patch if that arg is missing; keep existing
log_info/log_error/dump_flux_diagnostics behavior but only call kc rollout
status (and exit on failure) when you actually performed the patch, otherwise
treat it as a no-op and skip patch+rollout.
- Around line 446-447: The install_flux() routine currently calls the flux
install command without passing the cluster context, so when KUBECTL_CONTEXT is
set subsequent kc()/rollout checks target a different cluster; update the flux
install invocation (the flux install --version="${version}"
--components-extra=source-watcher call) to include the context flag by adding
--context "${KUBECTL_CONTEXT}" when KUBECTL_CONTEXT is non-empty so Flux is
installed into the same kube context used by kc() and later checks.
In `@kwok/scripts/validate-scheduling.sh`:
- Line 1481: The script now defaults KWOK_FLUX_SYNC_TIMEOUT to 500 seconds
(local deadline="${KWOK_FLUX_SYNC_TIMEOUT:-500}") but the header docs still
state 300s; update the documentation block that documents KWOK_FLUX_SYNC_TIMEOUT
to show the correct default of 500s (and adjust any example or descriptive text
to match the new deadline) so the documented default matches the variable
fallback.
- Around line 1582-1595: The current artifactgenerator check suppresses kubectl
errors and treats failures as "no ArtifactGenerators"; change the logic around
the kubectl call in validate-scheduling.sh so non-zero kubectl exits are handled
as errors: run kubectl get artifactgenerator -A -o json into ag_json but capture
its exit status, and if it failed (non-zero) log an error via log_error
including the kubectl failure output or exit code and exit/return non-zero (do
not proceed to treat as success); keep the existing handling of ag_total and
ag_not_ready when kubectl succeeds (variables: ag_json, ag_total, ag_not_ready).
In `@pkg/bundler/deployer/flux/flux_test.go`:
- Around line 1154-1160: The test calls collectHelmSources(refs, false,
"flux-system") and collectHelmSources(refs, true, "flux-system") but never
asserts that the returned source objects carry the passed namespace; add
assertions after each call to verify every returned source has Namespace ==
"flux-system" (e.g., iterate returned sources and t.Errorf/fatal if any
source.Namespace != "flux-system"). Apply the same additional assertions to the
similar block around the later test (the block at 1167-1180) so both
vendoring=false and vendoring=true cases validate namespace propagation from the
collectHelmSources call.
In `@pkg/cli/bundle.go`:
- Around line 197-204: The code currently calls cmd.String(...) for flux flags
only in the early conditional branches so config-provided values from
cfg.Bundle().Resolve() get bypassed; fix by registering the flags into temporary
variables (e.g., ociSourceNameFlag, fluxNamespaceFlag) or always registering
them, then after calling cfg.Bundle().Resolve() compute opts.ociSourceName and
opts.fluxNamespace using the usual CLI-over-config precedence (use the CLI flag
value if non-empty, otherwise use the value from cfg.Bundle().Resolve());
reference opts.deployer and opts.ociRef to keep the same conditional logic for
when these fields are relevant but perform the final assignment to
opts.ociSourceName and opts.fluxNamespace only after resolving the bundle
config.
In `@recipes/components/gpu-operator/manifests/kernel-module-params.yaml`:
- Line 31: The helm label value currently truncates only .Chart.Version
(helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" | trunc 63
}}), which can exceed Kubernetes' 63-char limit once .Chart.Name is prepended;
change the template to build the full "name-version" string first and then apply
replace/trunc to the combined value (i.e., format/printf the concatenation of
.Chart.Name and .Chart.Version, ensure "+" is replaced with "_" on the version,
then pipe the entire result through | trunc 63) so the final helm.sh/chart label
is guaranteed under 63 characters.
In `@recipes/components/network-operator/manifests/nfd-network-rule.yaml`:
- Line 38: The helm.sh/chart label currently only truncates .Chart.Version;
change the template that builds the label (the line producing helm.sh/chart
using .Chart.Name and .Chart.Version) to concatenate "< .Chart.Name >-<
.Chart.Version >", then run replace "+" "_" on the whole string, apply trunc 63,
and trim any trailing "-" so the final value never exceeds Kubernetes label
length; also update any related tests/manifests that expect the old untruncated
behavior to match the new sanitized/truncated label.
In `@tools/setup-tools`:
- Around line 445-449: The installer currently downloads and extracts the Flux
tarball directly (see FLUX_URL, FLUX_VERSION, FLUX_TMP, verify_download_url and
flux); change the flow to first download the tarball to a temporary file and
also obtain a pinned checksum (from a trusted source or bundled mapping keyed by
FLUX_VERSION), then verify the tarball's integrity with sha256sum (or openssl
dgst) before extracting: if checksum verification fails, abort and remove temp
files; only on successful verification proceed to extract into FLUX_TMP, mv the
flux binary to /usr/local/bin/flux and chmod +x. Ensure verify_download_url
still runs but do not extract until the checksum check passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 23a18d56-f9d3-48d1-bff0-ce6645f283da
📒 Files selected for processing (30)
docs/contributor/kwok-testing.mddocs/user/cli-reference.mdkwok/scripts/install-infra.shkwok/scripts/validate-scheduling.shpkg/bundler/bundler.gopkg/bundler/config/config.gopkg/bundler/config/config_test.gopkg/bundler/deployer/flux/doc.gopkg/bundler/deployer/flux/flux.gopkg/bundler/deployer/flux/flux_test.gopkg/bundler/deployer/flux/helm.gopkg/bundler/deployer/flux/sources.gopkg/bundler/deployer/flux/templates/README.md.tmplpkg/bundler/deployer/flux/templates/artifactgenerator.yaml.tmplpkg/bundler/deployer/flux/templates/configmap-values.yaml.tmplpkg/bundler/deployer/flux/templates/gitrepo-source.yaml.tmplpkg/bundler/deployer/flux/templates/helmrelease-chartref.yaml.tmplpkg/bundler/deployer/flux/templates/helmrelease.yaml.tmplpkg/bundler/deployer/flux/templates/helmrepo-source.yaml.tmplpkg/cli/bundle.gopkg/manifest/render.gopkg/manifest/render_test.gorecipes/components/gpu-operator/manifests/kernel-module-params.yamlrecipes/components/network-operator/manifests/ib-node-config-aks.yamlrecipes/components/network-operator/manifests/nfd-network-rule.yamlrecipes/components/network-operator/manifests/nic-cluster-policy-aks.yamlrecipes/components/nodewright-customizations/manifests/no-op.yamlrecipes/components/nodewright-customizations/manifests/tuning-gke.yamlrecipes/components/nodewright-customizations/manifests/tuning.yamltools/setup-tools
|
🌿 Preview your docs: https://nvidia-preview-fix-flux-oci-artifactgenerator.docs.buildwithfern.com/aicr |
There was a problem hiding this comment.
The ArtifactGenerator + ExternalArtifact shape via spec.chartRef is the right way to make local-chart HelmReleases reconcile under Flux OCI consumption, and the KWOK lane now actually validates HelmRelease Ready=True (was previously deferred).
Few net-new items that CodeRabbit has not already covered. All medium/low — none blocking:
- Chart label fix applies to 5 more manifests beyond the 2 CodeRabbit flagged (same
replace+truncpattern acrossnodewright-customizations/manifests/*.yamlandnetwork-operator/manifests/{ib,nic,nfd}-*.yaml). - Code duplication between the OCI-mode emit blocks in
generateManifestHelmChartandgenerateVendoredHelmComponent— worth extracting a helper before more deployer changes drift the two paths. - Kind control-plane node selection picks
.items[0]and labels/taints exactly one node. Comment says "the only real node" but isn't enforced — fragile if anyone runs against a multi-node Kind setup. --flux-*flags visible in--helpfor all deployers but only honored when--deployer flux.
Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
kwok/scripts/validate-scheduling.sh (1)
1582-1588:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t mix stderr into the JSON payload before
jqparsing.
kubectl ... -o json 2>&1can inject warnings intoag_json, makingjqparsing fail even when the command succeeds.Suggested fix
- ag_json=$(kubectl get artifactgenerator -A -o json 2>&1) && ag_rc=0 || ag_rc=$? + if ag_json=$(kubectl get artifactgenerator -A -o json 2>/dev/null); then + ag_rc=0 + else + ag_rc=$? + fi if (( ag_rc != 0 )); then - log_error "kubectl get artifactgenerator failed (rc=${ag_rc}): ${ag_json}" + log_error "kubectl get artifactgenerator failed (rc=${ag_rc})" + kubectl get artifactgenerator -A 2>&1 | tail -20 || true + dump_flux_failures return 1 fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kwok/scripts/validate-scheduling.sh` around lines 1582 - 1588, The command currently mixes stderr into ag_json by using "kubectl get artifactgenerator -A -o json 2>&1", which can corrupt the JSON for jq; change it so stdout is captured into ag_json and stderr into a separate variable (e.g., run kubectl without "2>&1" and capture stderr separately into ag_err), keep the existing ag_rc check, and on failure log ag_err instead of parsing ag_json; then use ag_json only for jq (references: variables ag_json, ag_rc, ag_total and the kubectl get artifactgenerator -A -o json command).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/setup-tools`:
- Around line 452-456: The script currently skips verification when
FLUX_EXPECTED is empty; instead, change the behavior so that if FLUX_EXPECTED is
not set the install aborts: replace the log_warning branch with an error and
non-zero exit (e.g., call log_error or echo an error then exit 1) so that when
FLUX_EXPECTED is empty the code does not proceed to install, and keep the
verify_sha256 call using FLUX_TMP/FLUX_TAR when FLUX_EXPECTED is present.
---
Duplicate comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1582-1588: The command currently mixes stderr into ag_json by
using "kubectl get artifactgenerator -A -o json 2>&1", which can corrupt the
JSON for jq; change it so stdout is captured into ag_json and stderr into a
separate variable (e.g., run kubectl without "2>&1" and capture stderr
separately into ag_err), keep the existing ag_rc check, and on failure log
ag_err instead of parsing ag_json; then use ag_json only for jq (references:
variables ag_json, ag_rc, ag_total and the kubectl get artifactgenerator -A -o
json command).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a4ce0181-7d1e-4593-8f88-a2b14091d3c8
📒 Files selected for processing (19)
.github/actions/kwok-test/action.yml.github/actions/load-versions/action.yml.github/actions/setup-build-tools/action.yml.github/workflows/kwok-recipes.yamlkwok/scripts/install-infra.shkwok/scripts/validate-scheduling.shpkg/bundler/deployer/flux/flux_test.gopkg/bundler/deployer/flux/helm.gopkg/cli/bundle.gopkg/manifest/render.gopkg/manifest/render_test.gorecipes/components/gpu-operator/manifests/kernel-module-params.yamlrecipes/components/network-operator/manifests/ib-node-config-aks.yamlrecipes/components/network-operator/manifests/nfd-network-rule.yamlrecipes/components/network-operator/manifests/nic-cluster-policy-aks.yamlrecipes/components/nodewright-customizations/manifests/no-op.yamlrecipes/components/nodewright-customizations/manifests/tuning-gke.yamlrecipes/components/nodewright-customizations/manifests/tuning.yamltools/setup-tools
Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
|
@mchmarny after you updated the branch - github incident jumped in - can you rerun the CI ? - thanks |
Summary
Local-chart
HelmReleasesemitted by the Flux deployer referenced a placeholderGitRepository(https://github.com/YOUR_ORG/YOUR_REPO.git) that stalls under OCI consumption becausespec.chart.spec.sourceRefrejectsOCIRepository. This PR switches local-chart components to emitArtifactGenerator+ExternalArtifactpairs viaspec.chartRef,enabling full HelmRelease reconciliation in the KWOK flux-oci lane.Motivation / Context
When consuming AICR bundles via Flux OCI,
helm-controllercannot fetch local charts from aGitRepositorythat points at a placeholder URL. Every local-chart HelmRelease stalls, and thedependsOnchain cascades the failure through the entire bundle.Fixes: #964
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)kwok/scripts/)Implementation Notes
Bundler changes (pkg/bundler/deployer/flux/):
New ArtifactGenerator CR template (
artifactgenerator.yaml.tmpl) extracts each local chart's subdirectory from the outerOCIRepositoryinto anExternalArtifact.New
helmrelease-chartref.yaml.tmpltemplate emitsHelmReleaseswithspec.chartRefpointing at theExternalArtifactinstead ofspec.chart.spec.sourceRefpointing at aGitRepository.When
OCISourceNameis set (via--flux-oci-source-name), local-chart paths (manifest-only, pre-manifests, vendored wrappers) all route through the ArtifactGenerator path. When empty, the existing GitRepository code path is preserved.All Flux CR templates now accept an explicit Namespace field instead of hardcoding flux-system.
CLI changes (
pkg/cli/bundle.go):New
--flux-oci-source-nameflag (default: aicr-bundle) sets theOCIRepositoryCR name thatArtifactGeneratorsreference.New
--flux-namespaceflag (default: flux-system) sets the namespace for all generated Flux CRs.Both flags are wired through
config.WithOCISourceName/config.WithFluxNamespace.KWOK test changes (
kwok/scripts/):install-infra.sh: Switched from raw manifest apply to flux install -
-components-extra=source-watcher, patches helm-controller with--feature-gates=ExternalArtifact=true, and waits forArtifactGenerator/ExternalArtifactCRDs.validate-scheduling.sh: wait_for_flux_sync now validates full HelmRelease reconciliation (
Ready=True) andArtifactGeneratorreadiness instead of only checking bundle application. Deadline bumped from 300s to 500s to accommodate the additional reconciliation steps. dump_flux_failures extended withArtifactGenerator,ExternalArtifact, and source-watcher diagnostics. Kind control-plane node is excluded from bundle DaemonSets to prevent CrashLoops on non-cloud nodes.tools/setup-tools: Added flux CLI installation.
One issue discovered during testing:
Flux appends + to the chart version (e.g.
0.1.0+09d01b0b4d7d), and the+character is invalid in Kubernetes label values. This breaks any manifest template that useshelm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version }}. This is a known Flux/Helm interaction issue documented in fluxcd/flux2#2868 (comment). Fixed by sanitizing the label with{{ .Chart.Version | replace "+" "_" | trunc 63 }}and addingreplace/truncto the bundler's manifest template engine.KWOK Kind control-plane node exclusion:
In the KWOK test lane, the Kind control-plane node (the only real node) must be excluded from bundle DaemonSets, otherwise pods like
ebs-csi-nodeandnfd-workeractually start on it, CrashLoop because cloud services are unavailable, and stall the entire HelmRelease dependsOn chain. Rather than skipping or manipulating HelmRelease readiness checks, we exclude the node properly:Label
eks.amazonaws.com/compute-type=hybrid , the ebs-csi-driver DaemonSet has a NotIn [fargate, auto, hybrid] nodeAffinity, so this prevents scheduling. Taintnfd-excluded=true:NoSchedule`, NFD worker respects this and skips the node.Testing
# Commands run (prefer `make qualify` for non-trivial changes) make qualifyArtifactGeneratorand chartRef HelmRelease generationRisk Assessment
Rollout notes:
The new
ArtifactGeneratorpath only activates when--flux-oci-source-nameis set (which happens automatically when --deployer flux + OCI output). Git-based Flux deployments are unaffected. Target clusters require Flux v2.7+ withsource-watcherdeployed andExternalArtifact=truefeature gate onhelm-controller.Checklist
make testwith-race)make lint)git commit -S) — GPG signing info