Skip to content

fix(bundler): replace placeholder GitRepository with ArtifactGenerator for Flux OCI#1017

Merged
mchmarny merged 6 commits into
NVIDIA:mainfrom
haarchri:fix/flux-oci-artifactgenerator
May 26, 2026
Merged

fix(bundler): replace placeholder GitRepository with ArtifactGenerator for Flux OCI#1017
mchmarny merged 6 commits into
NVIDIA:mainfrom
haarchri:fix/flux-oci-artifactgenerator

Conversation

@haarchri
Copy link
Copy Markdown
Contributor

Summary

Local-chart HelmReleases emitted by the Flux deployer referenced a placeholder GitRepository (https://github.com/YOUR_ORG/YOUR_REPO.git) that stalls under OCI consumption because spec.chart.spec.sourceRef rejects OCIRepository. This PR switches local-chart components to emit ArtifactGenerator + ExternalArtifact pairs via spec.chartRef, enabling full HelmRelease reconciliation in the KWOK flux-oci lane.

Motivation / Context

When consuming AICR bundles via Flux OCI, helm-controller cannot fetch local charts from a GitRepository that points at a placeholder URL. Every local-chart HelmRelease stalls, and the dependsOn chain cascades the failure through the entire bundle.

Fixes: #964
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: KWOK test infrastructure (kwok/scripts/)

Implementation Notes

Bundler changes (pkg/bundler/deployer/flux/):

New ArtifactGenerator CR template (artifactgenerator.yaml.tmpl) extracts each local chart's subdirectory from the outer OCIRepository into an ExternalArtifact.
New helmrelease-chartref.yaml.tmpl template emits HelmReleases with spec.chartRef pointing at the ExternalArtifact instead of spec.chart.spec.sourceRef pointing at a GitRepository.
When OCISourceName is set (via --flux-oci-source-name), local-chart paths (manifest-only, pre-manifests, vendored wrappers) all route through the ArtifactGenerator path. When empty, the existing GitRepository code path is preserved.
All Flux CR templates now accept an explicit Namespace field instead of hardcoding flux-system.

CLI changes (pkg/cli/bundle.go):
New --flux-oci-source-name flag (default: aicr-bundle) sets the OCIRepository CR name that ArtifactGenerators reference.
New --flux-namespace flag (default: flux-system) sets the namespace for all generated Flux CRs.
Both flags are wired through config.WithOCISourceName / config.WithFluxNamespace.

KWOK test changes (kwok/scripts/):
install-infra.sh: Switched from raw manifest apply to flux install --components-extra=source-watcher, patches helm-controller with --feature-gates=ExternalArtifact=true, and waits for ArtifactGenerator/ExternalArtifact CRDs.
validate-scheduling.sh: wait_for_flux_sync now validates full HelmRelease reconciliation (Ready=True) and ArtifactGenerator readiness instead of only checking bundle application. Deadline bumped from 300s to 500s to accommodate the additional reconciliation steps. dump_flux_failures extended with ArtifactGenerator, ExternalArtifact, and source-watcher diagnostics. Kind control-plane node is excluded from bundle DaemonSets to prevent CrashLoops on non-cloud nodes.
tools/setup-tools: Added flux CLI installation.

One issue discovered during testing:

Flux appends + to the chart version (e.g. 0.1.0+09d01b0b4d7d), and the + character is invalid in Kubernetes label values. This breaks any manifest template that uses helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version }}. This is a known Flux/Helm interaction issue documented in fluxcd/flux2#2868 (comment). Fixed by sanitizing the label with {{ .Chart.Version | replace "+" "_" | trunc 63 }} and adding replace/trunc to the bundler's manifest template engine.

KWOK Kind control-plane node exclusion:
In the KWOK test lane, the Kind control-plane node (the only real node) must be excluded from bundle DaemonSets, otherwise pods like ebs-csi-node and nfd-worker actually start on it, CrashLoop because cloud services are unavailable, and stall the entire HelmRelease dependsOn chain. Rather than skipping or manipulating HelmRelease readiness checks, we exclude the node properly:

Label eks.amazonaws.com/compute-type=hybrid , the ebs-csi-driver DaemonSet has a NotIn [fargate, auto, hybrid] nodeAffinity, so this prevents scheduling. Taint nfd-excluded=true:NoSchedule`, NFD worker respects this and skips the node.

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify
unset GITLAB_TOKEN && make kwok-test-deployer RECIPE=gb200-eks-training DEPLOYER=flux-oci 2>&1
[...]
[INFO] ========================================
[INFO] Results
[INFO] ========================================
  ✓ gb200-eks-training
[INFO] Cleaning up for next test...
node "gpu-0" force deleted
node "gpu-1" force deleted
node "gpu-2" force deleted
node "gpu-3" force deleted
node "system-0" force deleted
node "system-1" force deleted
No resources found
[INFO] All 1 recipe(s) passed!
[INFO] Cluster preserved. Delete with: kind delete cluster --name aicr-kwok-test
  • Unit tests added for ArtifactGenerator and chartRef HelmRelease generation
  • Unit tests added for replace/trunc template funcs
  • Unit tests added for OCISourceName/FluxNamespace config options
  • KWOK flux-oci lane now validates full HelmRelease reconciliation (Ready=True) and ArtifactGenerator readiness

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:
The new ArtifactGenerator path only activates when --flux-oci-source-name is set (which happens automatically when --deployer flux + OCI output). Git-based Flux deployments are unaffected. Target clusters require Flux v2.7+ with source-watcher deployed and ExternalArtifact=true feature gate on helm-controller.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

…chart HelmReleases referenced a placeholder GitRepository that stalls under OCI consumption. Emit ArtifactGenerator + ExternalArtifact pairs via spec.chartRef instead, add replace/trunc to manifest template funcs for helm.sh/chart label sanitization, and enable full HelmRelease reconciliation in the KWOK flux-oci lane.

Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
@haarchri haarchri requested review from a team as code owners May 24, 2026 11:39
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@haarchri haarchri changed the title fix(bundler): use ArtifactGenerator for Flux OCI local-charts, Local-… fix(bundler): replace placeholder GitRepository with ArtifactGenerator for Flux OCI May 24, 2026
@coderabbitai

This comment was marked as resolved.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@kwok/scripts/install-infra.sh`:
- Around line 475-487: The patch currently always appends
--feature-gates=ExternalArtifact=true which causes repeated churn; modify the
block so it first checks the helm-controller deployment args (using kc get
deployment helm-controller -n "${FLUX_NAMESPACE}" -o json and inspecting
.spec.template.spec.containers[0].args or via jq/jsonpath) for the exact flag
and only run kc patch if that arg is missing; keep existing
log_info/log_error/dump_flux_diagnostics behavior but only call kc rollout
status (and exit on failure) when you actually performed the patch, otherwise
treat it as a no-op and skip patch+rollout.
- Around line 446-447: The install_flux() routine currently calls the flux
install command without passing the cluster context, so when KUBECTL_CONTEXT is
set subsequent kc()/rollout checks target a different cluster; update the flux
install invocation (the flux install --version="${version}"
--components-extra=source-watcher call) to include the context flag by adding
--context "${KUBECTL_CONTEXT}" when KUBECTL_CONTEXT is non-empty so Flux is
installed into the same kube context used by kc() and later checks.

In `@kwok/scripts/validate-scheduling.sh`:
- Line 1481: The script now defaults KWOK_FLUX_SYNC_TIMEOUT to 500 seconds
(local deadline="${KWOK_FLUX_SYNC_TIMEOUT:-500}") but the header docs still
state 300s; update the documentation block that documents KWOK_FLUX_SYNC_TIMEOUT
to show the correct default of 500s (and adjust any example or descriptive text
to match the new deadline) so the documented default matches the variable
fallback.
- Around line 1582-1595: The current artifactgenerator check suppresses kubectl
errors and treats failures as "no ArtifactGenerators"; change the logic around
the kubectl call in validate-scheduling.sh so non-zero kubectl exits are handled
as errors: run kubectl get artifactgenerator -A -o json into ag_json but capture
its exit status, and if it failed (non-zero) log an error via log_error
including the kubectl failure output or exit code and exit/return non-zero (do
not proceed to treat as success); keep the existing handling of ag_total and
ag_not_ready when kubectl succeeds (variables: ag_json, ag_total, ag_not_ready).

In `@pkg/bundler/deployer/flux/flux_test.go`:
- Around line 1154-1160: The test calls collectHelmSources(refs, false,
"flux-system") and collectHelmSources(refs, true, "flux-system") but never
asserts that the returned source objects carry the passed namespace; add
assertions after each call to verify every returned source has Namespace ==
"flux-system" (e.g., iterate returned sources and t.Errorf/fatal if any
source.Namespace != "flux-system"). Apply the same additional assertions to the
similar block around the later test (the block at 1167-1180) so both
vendoring=false and vendoring=true cases validate namespace propagation from the
collectHelmSources call.

In `@pkg/cli/bundle.go`:
- Around line 197-204: The code currently calls cmd.String(...) for flux flags
only in the early conditional branches so config-provided values from
cfg.Bundle().Resolve() get bypassed; fix by registering the flags into temporary
variables (e.g., ociSourceNameFlag, fluxNamespaceFlag) or always registering
them, then after calling cfg.Bundle().Resolve() compute opts.ociSourceName and
opts.fluxNamespace using the usual CLI-over-config precedence (use the CLI flag
value if non-empty, otherwise use the value from cfg.Bundle().Resolve());
reference opts.deployer and opts.ociRef to keep the same conditional logic for
when these fields are relevant but perform the final assignment to
opts.ociSourceName and opts.fluxNamespace only after resolving the bundle
config.

In `@recipes/components/gpu-operator/manifests/kernel-module-params.yaml`:
- Line 31: The helm label value currently truncates only .Chart.Version
(helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" | trunc 63
}}), which can exceed Kubernetes' 63-char limit once .Chart.Name is prepended;
change the template to build the full "name-version" string first and then apply
replace/trunc to the combined value (i.e., format/printf the concatenation of
.Chart.Name and .Chart.Version, ensure "+" is replaced with "_" on the version,
then pipe the entire result through | trunc 63) so the final helm.sh/chart label
is guaranteed under 63 characters.

In `@recipes/components/network-operator/manifests/nfd-network-rule.yaml`:
- Line 38: The helm.sh/chart label currently only truncates .Chart.Version;
change the template that builds the label (the line producing helm.sh/chart
using .Chart.Name and .Chart.Version) to concatenate "< .Chart.Name >-<
.Chart.Version >", then run replace "+" "_" on the whole string, apply trunc 63,
and trim any trailing "-" so the final value never exceeds Kubernetes label
length; also update any related tests/manifests that expect the old untruncated
behavior to match the new sanitized/truncated label.

In `@tools/setup-tools`:
- Around line 445-449: The installer currently downloads and extracts the Flux
tarball directly (see FLUX_URL, FLUX_VERSION, FLUX_TMP, verify_download_url and
flux); change the flow to first download the tarball to a temporary file and
also obtain a pinned checksum (from a trusted source or bundled mapping keyed by
FLUX_VERSION), then verify the tarball's integrity with sha256sum (or openssl
dgst) before extracting: if checksum verification fails, abort and remove temp
files; only on successful verification proceed to extract into FLUX_TMP, mv the
flux binary to /usr/local/bin/flux and chmod +x. Ensure verify_download_url
still runs but do not extract until the checksum check passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 23a18d56-f9d3-48d1-bff0-ce6645f283da

📥 Commits

Reviewing files that changed from the base of the PR and between 6098765 and 9acb12b.

📒 Files selected for processing (30)
  • docs/contributor/kwok-testing.md
  • docs/user/cli-reference.md
  • kwok/scripts/install-infra.sh
  • kwok/scripts/validate-scheduling.sh
  • pkg/bundler/bundler.go
  • pkg/bundler/config/config.go
  • pkg/bundler/config/config_test.go
  • pkg/bundler/deployer/flux/doc.go
  • pkg/bundler/deployer/flux/flux.go
  • pkg/bundler/deployer/flux/flux_test.go
  • pkg/bundler/deployer/flux/helm.go
  • pkg/bundler/deployer/flux/sources.go
  • pkg/bundler/deployer/flux/templates/README.md.tmpl
  • pkg/bundler/deployer/flux/templates/artifactgenerator.yaml.tmpl
  • pkg/bundler/deployer/flux/templates/configmap-values.yaml.tmpl
  • pkg/bundler/deployer/flux/templates/gitrepo-source.yaml.tmpl
  • pkg/bundler/deployer/flux/templates/helmrelease-chartref.yaml.tmpl
  • pkg/bundler/deployer/flux/templates/helmrelease.yaml.tmpl
  • pkg/bundler/deployer/flux/templates/helmrepo-source.yaml.tmpl
  • pkg/cli/bundle.go
  • pkg/manifest/render.go
  • pkg/manifest/render_test.go
  • recipes/components/gpu-operator/manifests/kernel-module-params.yaml
  • recipes/components/network-operator/manifests/ib-node-config-aks.yaml
  • recipes/components/network-operator/manifests/nfd-network-rule.yaml
  • recipes/components/network-operator/manifests/nic-cluster-policy-aks.yaml
  • recipes/components/nodewright-customizations/manifests/no-op.yaml
  • recipes/components/nodewright-customizations/manifests/tuning-gke.yaml
  • recipes/components/nodewright-customizations/manifests/tuning.yaml
  • tools/setup-tools

Comment thread kwok/scripts/install-infra.sh Outdated
Comment thread kwok/scripts/install-infra.sh Outdated
Comment thread kwok/scripts/validate-scheduling.sh
Comment thread kwok/scripts/validate-scheduling.sh Outdated
Comment thread pkg/bundler/deployer/flux/flux_test.go
Comment thread pkg/cli/bundle.go
Comment thread recipes/components/gpu-operator/manifests/kernel-module-params.yaml Outdated
Comment thread recipes/components/network-operator/manifests/nfd-network-rule.yaml Outdated
Comment thread tools/setup-tools
@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ArtifactGenerator + ExternalArtifact shape via spec.chartRef is the right way to make local-chart HelmReleases reconcile under Flux OCI consumption, and the KWOK lane now actually validates HelmRelease Ready=True (was previously deferred).

Few net-new items that CodeRabbit has not already covered. All medium/low — none blocking:

  • Chart label fix applies to 5 more manifests beyond the 2 CodeRabbit flagged (same replace+trunc pattern across nodewright-customizations/manifests/*.yaml and network-operator/manifests/{ib,nic,nfd}-*.yaml).
  • Code duplication between the OCI-mode emit blocks in generateManifestHelmChart and generateVendoredHelmComponent — worth extracting a helper before more deployer changes drift the two paths.
  • Kind control-plane node selection picks .items[0] and labels/taints exactly one node. Comment says "the only real node" but isn't enforced — fragile if anyone runs against a multi-node Kind setup.
  • --flux-* flags visible in --help for all deployers but only honored when --deployer flux.

Comment thread pkg/bundler/deployer/flux/helm.go Outdated
Comment thread pkg/manifest/render.go
Comment thread kwok/scripts/validate-scheduling.sh Outdated
Comment thread pkg/cli/bundle.go
Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
kwok/scripts/validate-scheduling.sh (1)

1582-1588: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t mix stderr into the JSON payload before jq parsing.

kubectl ... -o json 2>&1 can inject warnings into ag_json, making jq parsing fail even when the command succeeds.

Suggested fix
-        ag_json=$(kubectl get artifactgenerator -A -o json 2>&1) && ag_rc=0 || ag_rc=$?
+        if ag_json=$(kubectl get artifactgenerator -A -o json 2>/dev/null); then
+            ag_rc=0
+        else
+            ag_rc=$?
+        fi
         if (( ag_rc != 0 )); then
-            log_error "kubectl get artifactgenerator failed (rc=${ag_rc}): ${ag_json}"
+            log_error "kubectl get artifactgenerator failed (rc=${ag_rc})"
+            kubectl get artifactgenerator -A 2>&1 | tail -20 || true
+            dump_flux_failures
             return 1
         fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kwok/scripts/validate-scheduling.sh` around lines 1582 - 1588, The command
currently mixes stderr into ag_json by using "kubectl get artifactgenerator -A
-o json 2>&1", which can corrupt the JSON for jq; change it so stdout is
captured into ag_json and stderr into a separate variable (e.g., run kubectl
without "2>&1" and capture stderr separately into ag_err), keep the existing
ag_rc check, and on failure log ag_err instead of parsing ag_json; then use
ag_json only for jq (references: variables ag_json, ag_rc, ag_total and the
kubectl get artifactgenerator -A -o json command).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/setup-tools`:
- Around line 452-456: The script currently skips verification when
FLUX_EXPECTED is empty; instead, change the behavior so that if FLUX_EXPECTED is
not set the install aborts: replace the log_warning branch with an error and
non-zero exit (e.g., call log_error or echo an error then exit 1) so that when
FLUX_EXPECTED is empty the code does not proceed to install, and keep the
verify_sha256 call using FLUX_TMP/FLUX_TAR when FLUX_EXPECTED is present.

---

Duplicate comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1582-1588: The command currently mixes stderr into ag_json by
using "kubectl get artifactgenerator -A -o json 2>&1", which can corrupt the
JSON for jq; change it so stdout is captured into ag_json and stderr into a
separate variable (e.g., run kubectl without "2>&1" and capture stderr
separately into ag_err), keep the existing ag_rc check, and on failure log
ag_err instead of parsing ag_json; then use ag_json only for jq (references:
variables ag_json, ag_rc, ag_total and the kubectl get artifactgenerator -A -o
json command).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a4ce0181-7d1e-4593-8f88-a2b14091d3c8

📥 Commits

Reviewing files that changed from the base of the PR and between 9acb12b and 08b27af.

📒 Files selected for processing (19)
  • .github/actions/kwok-test/action.yml
  • .github/actions/load-versions/action.yml
  • .github/actions/setup-build-tools/action.yml
  • .github/workflows/kwok-recipes.yaml
  • kwok/scripts/install-infra.sh
  • kwok/scripts/validate-scheduling.sh
  • pkg/bundler/deployer/flux/flux_test.go
  • pkg/bundler/deployer/flux/helm.go
  • pkg/cli/bundle.go
  • pkg/manifest/render.go
  • pkg/manifest/render_test.go
  • recipes/components/gpu-operator/manifests/kernel-module-params.yaml
  • recipes/components/network-operator/manifests/ib-node-config-aks.yaml
  • recipes/components/network-operator/manifests/nfd-network-rule.yaml
  • recipes/components/network-operator/manifests/nic-cluster-policy-aks.yaml
  • recipes/components/nodewright-customizations/manifests/no-op.yaml
  • recipes/components/nodewright-customizations/manifests/tuning-gke.yaml
  • recipes/components/nodewright-customizations/manifests/tuning.yaml
  • tools/setup-tools

Comment thread tools/setup-tools
haarchri added 2 commits May 25, 2026 09:10
Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
Signed-off-by: Christopher Haar <christopher.haar@upbound.io>
@haarchri haarchri requested a review from mchmarny May 25, 2026 08:34
@haarchri
Copy link
Copy Markdown
Contributor Author

@mchmarny after you updated the branch - github incident jumped in - can you rerun the CI ? - thanks

@mchmarny mchmarny enabled auto-merge (squash) May 26, 2026 15:07
@mchmarny mchmarny merged commit a578244 into NVIDIA:main May 26, 2026
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bundler(flux): local-chart HelmReleases unreachable under OCI consumption

2 participants