feat(bundler): configurable parent Application name (--app-name)#1036
Merged
Conversation
The argocd and argocd-helm deployers hardcoded the parent Argo
Application name (`aicr-stack` and `nvidia-stack`), so two AICR bundles
deployed to the same Argo CD namespace silently overwrote each other —
the second bundle replaced the first parent Application and orphaned
the first bundle's children. This blocks the multi-bundle topology
intended for clusters that consume more than one recipe.
This change adds a new `appName` plumbing surface that flows through
the bundler config:
- CLI: `aicr bundle --app-name <name>`
- API: `POST /v1/bundle?app-name=<name>`
- Config: `spec.bundle.deployment.appName`
Defaults (`aicr-stack` for argocd-helm, `nvidia-stack` for argocd) are
preserved for backward compatibility. The name is validated as a
DNS-1123 subdomain at the parse boundary so an invalid value fails fast
in `aicr bundle` / 400 from the API, not at apiserver admission time.
Rejected on `--deployer helm` / `--deployer flux` so a misplaced flag
fails loudly.
For `--deployer argocd-helm`, the bundle-time value is written into
the chart's root values.yaml and the parent App template reads from
`{{ .Values.appName | default "aicr-stack" | quote }}`, so an operator
can still override at install time:
helm install gpu-stack . --set appName=ops-runtime ...
For `--deployer argocd`, the value is baked into the rendered
app-of-apps.yaml — that deployer materializes a static manifest, not a
chart, so the choice cannot be deferred to apply time. The generated
README's `argocd app get/sync` examples are templated to match.
Out of scope (per #1011 thread):
- Flux template parent-Kustomization name (separate doc-only fix)
- argocd-helm chart name `aicr-bundle` (already fixed in #1032)
Fixes #1011
Contributor
|
🌿 Preview your docs: https://nvidia-preview-feat-configurable-app-name-1011.docs.buildwithfern.com/aicr |
This comment was marked as resolved.
This comment was marked as resolved.
Contributor
Coverage Report ✅
Coverage BadgeMerging this branch changes the coverage (1 decrease, 5 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
CodeRabbit review on #1036: 1. cli-reference.md `--app-name` entry omitted `helmfile` from the "rejected on" list — corrected to mention all non-Argo deployers (matches the API docs' "Rejected on other deployers" wording). 2. argocd.Generator.Generate / argocdhelm.Generator.Generate now validate AppName at the deployer boundary via bundlercfg.ValidateAppName. Direct library callers (bypassing the CLI/API validation layer) can no longer ship a manifest whose name would only be rejected at apiserver admission. Empty still resolves to the deployer's default. New TestGenerate_AppNameValidatedAtBoundary pins this in both deployers. 3. `--app-name` was missing from runBundleCmd's validateSingleValueFlags list — added so repeated `--app-name first --app-name second` is rejected. New chainsaw step in tests/chainsaw/cli/duplicate-flags exercises the integration path end-to-end.
yuanchen8911
approved these changes
May 26, 2026
lalitadithya
approved these changes
May 26, 2026
10 tasks
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
May 26, 2026
The argocd-helm-oci wrapper script was passing the FULL bundle URL to `helm install --set repoURL=…` (including the per-recipe chart name at the end). That matched the pre-PR-NVIDIA#1032 contract where the parent Application's `source.chart` was hardcoded to `aicr-bundle`. PR NVIDIA#1032 (and NVIDIA#1035's reinforcement) changed the parent App template to expect the parent-namespace-only repoURL and to append .Chart.Name itself via the separate `source.chart` field. The wrapper script wasn't updated to match. Result on every PR with argocd-helm-oci Tier-1 KWOK coverage: the parent App resolves to `oci://registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>`, the OCI artifact lookup 404s, gpu-operator-post's Application can never sync, and the whole stack times out on `GitOps sync timeout strike 1/3`. The failure was masked on `main` because the most-recent KWOK Cluster Validation run on `main` (#26469449378 at 0d3e62d, success) ran *before* PR NVIDIA#1035 merged. After NVIDIA#1035 / NVIDIA#1036 / NVIDIA#1038 all landed on main, no fresh KWOK run has triggered on `main` yet — but the next one will fail the same way every open PR's argocd-helm-oci Tier-1 jobs are currently failing. Fix is a one-line drop of the per-recipe suffix from OCI_IN_CLUSTER_REF in the argocd-helm-oci branch of generate_bundle. The flux branch keeps the per-recipe suffix because flux's OCIRepository CR consumes the FULL artifact URL (recipe segment included). Updated the surrounding comment to point at the post-NVIDIA#1032 contract so the next reader understands the asymmetry. End-to-end check (verified from PR NVIDIA#1030's debug artifact at b3f2296): repo-server log shows `registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>: not found`, caused by the same double-append. With the recipe suffix dropped, Argo's resolution `<repoURL>/<chart>:<tag>` aligns with the pushed artifact at `oci://…/aicr/<recipe>:<tag>`. Refs PR NVIDIA#1030 (where this surfaced), PR NVIDIA#1032 (contract change), PR NVIDIA#1035 (parent App template enforcement).
16 tasks
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
May 26, 2026
The argocd-helm-oci wrapper script was passing the FULL bundle URL to `helm install --set repoURL=…` (including the per-recipe chart name at the end). That matched the pre-PR-NVIDIA#1032 contract where the parent Application's `source.chart` was hardcoded to `aicr-bundle`. PR NVIDIA#1032 (and NVIDIA#1035's reinforcement) changed the parent App template to expect the parent-namespace-only repoURL and to append .Chart.Name itself via the separate `source.chart` field. The wrapper script wasn't updated to match. Result on every PR with argocd-helm-oci Tier-1 KWOK coverage: the parent App resolves to `oci://registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>`, the OCI artifact lookup 404s, gpu-operator-post's Application can never sync, and the whole stack times out on `GitOps sync timeout strike 1/3`. The failure was masked on `main` because the most-recent KWOK Cluster Validation run on `main` (#26469449378 at 0d3e62d, success) ran *before* PR NVIDIA#1035 merged. After NVIDIA#1035 / NVIDIA#1036 / NVIDIA#1038 all landed on main, no fresh KWOK run has triggered on `main` yet — but the next one will fail the same way every open PR's argocd-helm-oci Tier-1 jobs are currently failing. Fix is a one-line drop of the per-recipe suffix from OCI_IN_CLUSTER_REF in the argocd-helm-oci branch of generate_bundle. The flux branch keeps the per-recipe suffix because flux's OCIRepository CR consumes the FULL artifact URL (recipe segment included). Updated the surrounding comment to point at the post-NVIDIA#1032 contract so the next reader understands the asymmetry. End-to-end check (verified from PR NVIDIA#1030's debug artifact at b3f2296): repo-server log shows `registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>: not found`, caused by the same double-append. With the recipe suffix dropped, Argo's resolution `<repoURL>/<chart>:<tag>` aligns with the pushed artifact at `oci://…/aicr/<recipe>:<tag>`. Refs PR NVIDIA#1030 (where this surfaced), PR NVIDIA#1032 (contract change), PR NVIDIA#1035 (parent App template enforcement).
This was referenced May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Operators can now deploy multiple non-overlapping AICR bundles to the same Argo CD namespace by passing a distinct
--app-name(orapp-namequery parameter) per bundle. The flag flows through CLI, API, and config-file surfaces and is honored by both theargocdandargocd-helmdeployers.Motivation / Context
The
argocdandargocd-helmdeployers hardcoded the parent Argo Application name (nvidia-stackandaicr-stack). Two bundles installed into the same Argo CD namespace would each create a parent Application with the same name — the second overwrote the first, orphaning every child Application that the first had been managing. This blocks the multi-bundle topology intended for clusters that consume more than one recipe.Fixes: #1011
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/bundler,pkg/component/*)docs/,examples/)Implementation Notes
Plumbing:
appNameflows throughpkg/bundler/config.Config(private field,AppName()getter,WithAppNameoption) andpkg/config/DeploymentSpec.AppName(YAML/JSONappName).BundleSpec.Resolvevalidates the value as a DNS-1123 subdomain at the wire-to-typed boundary.Validation:
config.ValidateAppName(inpkg/bundler/config) returnsErrCodeInvalidRequestfor non-DNS-1123 input. Empty is allowed and means "use the deployer's default." CLI/API both call it before constructing the bundler config.Deployer behavior:
argocd-helm: parent App template reads{{ .Values.appName | default "aicr-stack" | quote }}. When--app-nameis supplied, the value is written into the bundle's rootvalues.yamlso it is the chart default;helm install --set appName=...still wins. This matches the URL-portable contract that argocd-helm already follows forrepoURL/targetRevision.argocd: the value is threaded intoAppOfAppsData.AppNameandReadmeData.AppName, baked into the renderedapp-of-apps.yamland the README'sargocd app get/syncexamples at bundle time. This deployer materializes a static manifest, not a chart, so the choice cannot be deferred to apply time.Strictness:
--app-name/app-nameis rejected on--deployer helmand--deployer flux(CLI returnsErrCodeInvalidRequest; API returns 400). Silently accepting it would mislead operators expecting their flag to take effect.Template filename: kept fixed at
templates/aicr-stack.yamlfor argocd-helm even when--app-nameis overridden. The renderedmetadata.nameis dynamic via.Values.appName; the filename is bundle-internal and stable across install-time--setoverrides.Out of scope (per #1011 thread):
aicr-bundle(already fixed in fix(bundler): derive argocd-helm chart name from OCI artifact path #1032)Testing
make qualify # tests + lint + e2e + scan + license headers — all passedNew tests:
config.TestWithAppName,config.TestValidateAppName— option round-trip + DNS-1123 validation tableargocdhelm.TestHelmTemplate_AppNameOverride— livehelm templatecovering default, bundle-timeAppName, install-time--set appName, and bundle-time + install-time-override comboargocd.TestGenerate_AppName— default vs explicitAppNameflows into bothapp-of-apps.yamland READMEcli.TestParseBundleCmdOptions_AppName— flag parsing across argocd, argocd-helm, helm, flux deployers + invalid DNS rejectionbundler.TestBundleRequestAppNameParam— API query-param accepted on argocd/argocd-helm, 400 on helm + invalid names.Values.appNametemplate lineRisk Assessment
aicr-stack/nvidia-stackstill in the rendered output for any caller that omits--app-name). The flag is gated by deployer kind so a typo on a helm/flux bundle fails loudly. Easy to revert.Rollout notes: No migration required. Existing recipes and config files continue to bundle identically. Adopters with two bundles in one Argo CD namespace can now set distinct
--app-namevalues.Checklist
make testwith-race)make lint)git commit -S)