feat: os-talos mixin + bundler preManifestFiles support#856
Conversation
ManifestFiles applies after the primary chart (sync-wave N+1). PreManifestFiles applies before (sync-wave N-1) so resources the chart depends on — like a Namespace with PSS-privileged labels — can land ahead of the pods that need them. Merge semantics mirror ManifestFiles exactly: union the base and overlay lists, deduplicate by path, preserve order. Used by the os-talos mixin (follow-up commits) to attach per-component talos/namespace.yaml files in front of each privileged chart.
Introduce manifestPhase enum (pre/post) and a shared collectComponentManifestsByPhase body so adding pre-manifest support in follow-up commits doesn't fork the loader. Existing collectComponentManifests is now a thin wrapper over the post path; all current call sites stay identical. The new collectComponentPreManifests companion is unused in this commit and gets wired into the bundler entry points in the next commit.
localformat.Options, the three deployer Generators (helm, argocd, argocdhelm), and bundler.go's three call sites all now distinguish ComponentPreManifests from ComponentPostManifests. The writer keeps reading only from the post map for now — Task 4 wires the pre-injection branch — so output is unchanged because the pre map stays empty for every existing recipe (none sets PreManifestFiles). Also folds in two cleanups from the Task 2 review: - iota order swapped so phasePostManifests is the zero value (defensive: a future unspecified phase falls through to legacy post behavior, not pre). - A default arm on the switch in collectComponentManifestsByPhase returns ErrCodeInternal rather than silently empty results. - Removed the //nolint:unused on collectComponentPreManifests — it's now wired into all three deployer call sites.
Address two findings from the Task 3 code review:
* Both pre- and post-manifest collection paths used the same error
message ("failed to collect component manifests"), so a stack trace
couldn't tell which phase failed. Distinct messages now name the
phase ("...pre-manifests" / "...post-manifests").
* The six error-wrap blocks across the three deployer call sites in
buildDeployer reproduced what pkg/errors.PropagateOrWrap already
does. Collapsed each onto a single PropagateOrWrap call. stderrors
is no longer imported by bundler.go where applicable.
Pure refactor of the error-handling shape — no behavior change beyond
the error message text.
…lder Add an injectionPhase enum (pre/post) and an Options.injectAuxiliaryFolder helper used by both phases. Refactored the existing post-injection block inside KindUpstreamHelm to use the helper (byte-identical output verified via existing goldens) and added the parallel pre-injection call before the primary folder emission in every per-component iteration. The collision check is extended: a component with preManifestFiles cannot coexist with an explicitly-declared <name>-pre component (symmetric with the existing <name>-post check). The folder upper bound rises from 2*N to 3*N to account for the new third folder per component. Sync-wave allocation in argocd.go is unchanged — wave = folder index — so an emitted pre folder lands at wave N-1, primary at N, post at N+1 for free without any argocd code change. Existing recipes don't set PreManifestFiles, so all current bundles are byte-identical.
…e collision Address two optional one-line suggestions from the Task 4 code review: - Note on injectionPhase: explain why it's string-typed (folder-name + error-message embedding) given that the sibling manifestPhase in pkg/bundler/bundler.go uses int-iota. - Note on the pre collision check: spell out why there's no Repository-guard there (pre injection runs regardless of primary kind), so the asymmetry with the post block is intentional rather than missing.
TestBundleGolden_MixedWithPre constructs a single-component Generator with ComponentPreManifests set (a Namespace YAML with PSS-privileged labels) and an upstream Helm chart. Verifies the bundle emits: 001-foo-pre/ (local-helm wrapping the namespace manifest) 002-foo/ (upstream-helm primary) Locks in the invariant that the pre folder lands BEFORE the primary (opposite of -post). Pairs with the writer change in commit 1b290883 which added the pre-injection branch via injectAuxiliaryFolder. Existing 5 helm goldens (upstream_helm_only, manifest_only, mixed_gpu_operator, kai_scheduler_present, nodewright_present) remain byte-identical — pre-only scenarios don't perturb them.
…st doc The helm deployer sequences via filesystem lexicographic order, not Argo CD sync-waves. Code-quality review of commit 6d1a374b flagged the cross-deployer 'sync-wave' framing as potentially misleading in this file. One-line trim of the doc comment on TestBundleGolden_MixedWithPre.
Extend TestBundleGolden_MixedGPUOperator to also set ComponentPreManifests (a privileged-gpu-operator Namespace YAML), so the scenario now exercises the full three-phase emission: 001-gpu-operator-pre/ (Namespace manifest, NEW) 002-gpu-operator/ (upstream chart, index shifted from 001) 003-gpu-operator-post/ (dcgm-exporter, index shifted from 002) Locks in the lexicographic ordering that deploy.sh's [0-9][0-9][0-9]-*/ glob iterates at install time. The Argo deployer will see the same shift via folder-index → sync-wave; that mirror scenario lands in the next commit. Existing 5 helm goldens stay byte-identical — only this scenario's expected output changes.
TestBundleGolden_MixedWithPre freezes the Argo CD bundle output for a mixed component with both ComponentPreManifests and ComponentPostManifests set. The bundler emits three folders: 001-gpu-operator-pre/ (Namespace manifest, sync-wave 0) 002-gpu-operator/ (upstream chart, sync-wave 1) 003-gpu-operator-post/ (dcgm-exporter config, sync-wave 2) Sync-wave comes from folder index (argocd.go:472), so the pre folder naturally lands at wave 0 ahead of the chart at wave 1 — no argocd deployer change needed. Locks in the wave-from-index invariant against future writer refactors.
Five new Namespace manifests under <component>/talos/namespace.yaml,
each carrying pod-security.kubernetes.io/{enforce,audit,warn}=privileged
plus AICR-managed selectors. Referenced from the upcoming os-talos
mixin via componentRefs[].preManifestFiles so they land at
sync-wave N-1 ahead of each chart at wave N.
Components covered: gpu-operator, network-operator, nvsentinel,
nvidia-dra-driver-gpu, nodewright-operator. Out of scope: cloud-
specific drivers (aws-*, gke-*) and nodewright-customizations —
deferred per docs/superpowers/specs/2026-05-11-talos-namespace-mixin-design.md.
Redirects gpu-operator, network-operator, nvsentinel, nvidia-dra-driver-gpu, and nodewright-operator to their privileged-<name> namespaces and bundles each component's talos/namespace.yaml via preManifestFiles. Opted into via a leaf overlay's spec.mixins: [os-talos]; no leaf overlay change in this commit (recipe-author territory).
Standalone page documenting what the mixin does, the five privileged-<name> namespaces it generates, the PSA label set, apply ordering, and what's out of scope. Cross-linked from recipe-development.md so recipe authors find it when picking os: talos. Sidebar entry added in site/.vitepress/config.ts so check-docs-sidebar passes.
…loy.sh template The undeploy.sh template on main grew new pre-flight extra-CRD cleanup logic after this branch's cherry-picks were authored, so the mixed_with_pre golden's bundled undeploy.sh was 16 lines short. Pure regenerated-golden delta; no test or production code changed.
…name Renaming ComponentManifests → ComponentPostManifests in the Test_Mixed struct literal threw the column alignment off; gofmt straightens it. Pure formatting; no behavior change.
This comment was marked as resolved.
This comment was marked as resolved.
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
mchmarny
left a comment
There was a problem hiding this comment.
Nicely structured refactor — the phase-parameterized collector and the shared injectAuxiliaryFolder helper keep the writer footprint small, and the goldens lock the lexicographic install order. Mixin authoring is clean, and the docs page is well-scoped.
One blocking issue: the helm-deployer pre-folder pattern (helm install … --namespace privileged-<x> --create-namespace against a chart whose only template is Namespace/privileged-<x>) is a known Helm 3 install conflict — Helm rejects the rendered Namespace as "already exists, missing managed-by=Helm". Every Talos component will hit this on first install via the helm deployer. Argo CD path may tolerate it; Helm path will not. See the comment on 001-foo-pre/install.sh for fix options.
Secondary asks: an e2e test that resolves the os-talos mixin end-to-end, a vendor + pre combination test, and stronger coupling between ComponentRef.namespace and the pre-manifest's Namespace.metadata.name.
CI is green and the goldens cover what they claim — the gap is install-time, not file-shape. Marking as request-changes because the production scenario is the one the goldens don't exercise.
… only
ADR-005 originally rejected any mixin componentRef whose name already
exists in the inheritance chain. That blocked the os-talos mixin from
contributing namespace + preManifestFiles overrides to components like
gpu-operator that are already declared upstream — exactly what the
mixin needs to do for OS-conditional PSA bookkeeping on Talos.
Relax the rule to a field-scoped carve-out: a mixin componentRef whose
name collides with the chain is allowed if and only if the mixin sets
nothing beyond the additive set {Namespace, ManifestFiles,
PreManifestFiles}. Identity / sourcing fields (Chart, Type, Source,
Version, Tag, Path, ValuesFile, Overrides, Patches, DependencyRefs,
Cleanup, ExpectedResources, HealthCheckAsserts) still produce a hard
error — those are the fields ADR-005's 'Silent constraint override'
risk row was protecting, and the mitigation stays intact for them.
Implementation:
- New mixinComponentRefSafeForMerge() helper inspects an incoming
mixin entry and returns the first offending identity field, so the
resolver's error message names the violation precisely.
- mergeMixins() drops the blanket component-name collision check and
calls the helper on name collisions; constraint name collision check
is unchanged.
Tests:
- TestMixinComponentRefSafeForMerge: 17 cases covering every additive-
safe field combination plus every identity field individually.
- TestMixinOSTalos_AppliesPrivilegedNamespacesAndPreManifests: e2e
check that loading the os-talos mixin from the embedded data and
applying it to a synthetic spec with concrete chart identities
produces the five privileged-<name> namespaces and the matching
talos/namespace.yaml entries on each component's PreManifestFiles.
Docs:
- ADR-005 Phase 3 step 7, the exit criteria, and the Risk Table all
updated to spell out the carve-out and preserve the rationale.
…>/manifests/
Move the five per-component talos namespace manifests from
recipes/components/<n>/talos/namespace.yaml to
recipes/components/<n>/manifests/talos-namespace.yaml. Aligns with the
established 'manifests/' convention for component-attached YAML and
reuses the existing components/*/manifests/*.yaml embed glob, dropping
the components/*/talos/*.yaml entry I added earlier in the branch.
The flat 'talos-namespace.yaml' naming (rather than a 'manifests/talos/'
sub-directory) avoids growing the embed surface until there's actually
more than one Talos-conditional manifest per component to justify it.
Updated:
- 5 file moves (git tracks as renames; content unchanged)
- recipes/data.go embed glob (drops components/*/talos/*.yaml)
- recipes/mixins/os-talos.yaml (5 preManifestFiles paths)
- docs/integrator/talos-integration.md (manifest-source column + prose)
- pkg/recipe/metadata_store_test.go (TestMixinOSTalos_* path expectations)
- pkg/bundler/deployer/helm/helm_test.go (golden scenarios)
- pkg/bundler/deployer/argocd/argocd_test.go (golden scenario assertion path)
- All three golden trees (mixed_with_pre helm + argocd, extended
mixed_gpu_operator) regenerated.
Verified end-to-end with the local mydata leaf overlay:
aicr recipe --data ./mydata --service eks --accelerator h100 --intent inference --os talos -o /tmp/talos.yaml # 16 components
aicr bundle -r /tmp/talos.yaml -o bundle # 111 files
ls bundle/ | grep gpu-operator
016-gpu-operator-pre # Namespace + PSS labels (sync-wave N-1)
017-gpu-operator # upstream chart
018-gpu-operator-post # dcgm-exporter / etc.
…uard, undeploy -pre mapping Addresses PR #856 review feedback. Pre-folder install.sh no longer passes --create-namespace to helm. The pre-injection chart's Namespace template owns namespace creation; when --create-namespace ran first the API created the namespace without Helm-ownership annotations and helm 3 refused to import it ("missing app.kubernetes.io/managed-by=Helm"). The install template now conditionally emits --create-namespace via a new CreateNamespace flag; pre folders pass false, primary and post pass true. Vendored primary folders thread the flag too. Add bundle-time drift guard: any Namespace doc in a pre-manifest must target ComponentRef.Namespace. Without this, a typo in the mixin manifest produces a bundle that installs cleanly past type checks but blows up at helm install with an opaque "namespace not found". Wire the helm-deployer undeploy.sh template to map "<name>-pre" folders to their parent namespace. Previously only primary and "-post" had mappings, so foo-pre's ns would resolve to "" and the release was silently skipped, leaving the privileged-namespace release orphaned. Generalize the ServerSideApply rationale comment in the Argo CD Application template so it covers Applications that don't need the 262144-byte annotation-cap workaround (pre-folder Namespaces, post-folder ConfigMaps) — keeping it uniform across every Application avoids per-Application template branching. Other review items: - collectComponentManifestsByPhase switches to errors.PropagateOrWrap so a future provider returning a more specific code (e.g. ErrCodeNotFound from a layered provider) isn't masked. - PreManifestFiles godoc calls out the ".." traversal guarantees from the data provider so a recipe author understands the path contract. - Stale "the writer threads the map through but does not yet emit pre-phase folders" comment removed from localformat/doc.go. - talos-integration.md drops the leading HTML license comments (Fern CI rejects them) and labels its layout fence as `text`. - recipe-development.md clarifies that os-talos appends to spec.mixins rather than replaces, and explains that the mixin supplies Namespaces via componentRefs[].preManifestFiles. Tests: - TestWrite_PreFolderInstallOmitsCreateNamespace pins the install.sh flag contract for pre / primary / post folders. - TestWrite_VendorCharts_PreManifests covers VendorCharts=true with ComponentPreManifests so the pre folder is emitted ahead of the vendored primary and not recorded in the vendor audit. - TestWrite_PreManifestNamespaceDrift exercises the new drift guard. - mixed_gpu_operator fixture namespace updated to privileged-gpu-operator to match its Namespace metadata.name; all helm/argocd/argocdhelm goldens regenerated. make qualify passes locally.
…S-mixin example Addresses additional CodeRabbit comments on PR #856. writer.go: replace the 3*len(Components) > 999 fail-fast with a precise count of emitted folders using the same pre/primary/post rules the main loop applies. The old bound rejected valid bundles once a recipe had more than 333 components even when most components produced only their primary folder — e.g., 500 plain upstream-Helm components emit 500 NNN-* directories but the old check tripped before any work happened. Add TestWrite_FolderLimit_CountsEmissionsNotComponents pinning that 400 primary-only components stay under the cap and 1000 exceed it. recipe-development.md: drop the misleading "[os-ubuntu, os-talos]" example. OS-scoped mixins are mutually exclusive — combining two is a recipe authoring error, not a supported composition — so the example now uses "[platform-kubeflow, os-talos]" alongside an explicit note that os-ubuntu + os-talos in one overlay is not valid.
Addresses CodeRabbit comments on PR #856. writer_test.go: TestWrite_PreManifestNamespaceDrift's docstring was sitting above TestWrite_FolderLimit_CountsEmissionsNotComponents after that test got inserted between the comment and its function. Move the comment back to its function. writer.go: validatePreManifestNamespace breaks the inner YAML decode loop on parse error, which skips any subsequent docs in the same multi-doc file. Document this in the function comment alongside the escape hatch (swap break for continue) so a future change to allow multiple Namespace docs per pre-manifest file has the breadcrumb.
…pace KWOK e2e for the GB200 recipes (#868) surfaced a latent bug in PR #856: pre-injection install.sh unconditionally omits --create-namespace, which deadlocks any recipe that ships a bare-resource manifest (ConfigMap, ResourceQuota, etc.) in preManifestFiles. The target namespace doesn't exist yet — the primary chart's helm install --create-namespace hasn't run — so the pre-pass install fails with 'namespaces "<n>" not found'. The original 'pre = no --create-namespace' rule was correct for the Talos case (chart ships a Namespace template; --create-namespace would collide with Helm 3 ownership annotations) but wrong as a blanket rule. Move the decision into writeLocalHelmFolder: scan rendered manifests for a top-level 'kind: Namespace'; if one is present, downgrade createNamespace to false; otherwise honor the caller's request. Callers (injectAuxiliaryFolder, the primary local-helm path) now uniformly ask for createNamespace=true and let detection decide. Both call shapes are exercised: - Talos: ships a Namespace → detection downgrades → install.sh omits --create-namespace (unchanged behavior; existing goldens pass) - GB200 kernel-module-params: ships only a ConfigMap → detection leaves it true → install.sh creates 'gpu-operator' namespace before the ConfigMap, ahead of the primary chart's own --create-namespace which becomes a no-op against the existing namespace Refs #859, follow-up to #856
Summary
Adds an opt-in
recipes/mixins/os-talos.yamlmixin that overrides install namespaces toprivileged-<component>and bundles PSA-privileged Namespace manifests for five components, plus the bundler change that makes those Namespace manifests apply before the chart instead of after.Motivation / Context
Talos Linux enforces Pod Security Admission (PSA)
restrictedby default. AICR components that need privileged access (host filesystem, host network, root user) get rejected. The Talos collector PR (#762) addressed the snapshot agent's pod-shape; this PR addresses the downstream components.Fixes part of #565.
Related: #762 (Talos local-test harness + snapshot agent OS=talos branch).
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)recipes/components/<name>/talos/namespace.yaml× 5,recipes/mixins/os-talos.yaml,site/.vitepress/config.tsImplementation Notes
recipes/mixins/os-talos.yaml(new)A new mixin (
kind: RecipeMixin) that recipe authors opt into viaspec.mixins: [os-talos]. For each of five components —gpu-operator,network-operator,nvsentinel,nvidia-dra-driver-gpu,nodewright-operator— it overrides the install namespace toprivileged-<name>and attaches a per-componenttalos/namespace.yamlvia the newpreManifestFilesfield. ConstraintOS.release.ID == talosguards against accidental non-Talos installs.Excluded by design (deferred follow-up tracked under #565): cloud-specific drivers (
aws-ebs-csi-driver,aws-efa,gke-nccl-tcpxo),nodewright-customizations(per-node privileged story not settled),kube-prometheus-stack(onlynode-exporterneeds privileged; whole-chart move is too coarse).Per-component Namespace manifests
Five new files at
recipes/components/<name>/talos/namespace.yaml. Each carriespod-security.kubernetes.io/{enforce,audit,warn}=privilegedplus AICR-managed selectors (app.kubernetes.io/managed-by: aicr,app.kubernetes.io/component: <name>,aicr.nvidia.com/os: talos) so fleet-wide audits can identify them.Bundler
preManifestFilessupportMechanical extension of the existing
ManifestFilespost-injection path:pkg/recipe/metadata.go): newComponentRef.PreManifestFiles []stringfield; additive merge mirrors the existingManifestFilesmerge.pkg/bundler/bundler.go): a phase-parameterizedcollectComponentManifestsByPhaseshared bycollectComponentManifests(post, preserved as the original entry point) and the newcollectComponentPreManifests. One body, two thin wrappers.pkg/bundler/bundler.go+pkg/bundler/deployer/{argocd,argocdhelm,helm}/): each Generator now carriesComponentPreManifestsandComponentPostManifests. Pre/post error paths are differentiated and route througherrors.PropagateOrWrap.pkg/bundler/deployer/localformat/writer.go): newOptions.injectAuxiliaryFolder(idx, c, phase)helper used by both pre and post emission — singlewriteLocalHelmFoldercall site for auxiliary folders. Pre-injection fires beforeclassify(...)so it works regardless of primary kind (upstream-helm, local-helm, kustomize, or vendored). Folder upper-bound raised from2*Nto3*N. Collision check covers both<name>-preand<name>-post. The pre path coexists withVendorCharts; the post path is skipped under vendoring (matches the pre-existing "vendored mode collapses mixed into one folder" contract).Documentation
docs/integrator/talos-integration.mddocuments what the mixin does, the five privileged namespaces, the PSA label set, apply ordering, and what's out of scope.docs/integrator/recipe-development.mdso recipe authors find the new page when they pickos: talos.site/.vitepress/config.ts(make check-docs-sidebarenforces).Testing
Golden coverage for the new code path:
pkg/bundler/deployer/helm/testdata/mixed_with_pre/—preManifestFiles-only scenario. Bundle =001-foo-pre/(local-helm wrapping Namespace) +002-foo/(upstream chart). Locks in lexicographic ordering for thedeploy.sh[0-9][0-9][0-9]-*/glob.pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/— extended scenario covering pre + primary + post in one bundle (001-gpu-operator-pre/,002-gpu-operator/,003-gpu-operator-post/).pkg/bundler/deployer/argocd/testdata/mixed_with_pre/— Argo mirror. Sync-waves0,1,2for pre/primary/post; verifies the wave-from-folder-index allocation inargocd.go.Existing 5 helm and 5 argocd-style goldens remain byte-identical: pre-injection is a no-op for any recipe that doesn't set
PreManifestFiles.Risk Assessment
Rollout notes: Pure additive. No existing recipe sets
preManifestFiles, so every shipping bundle's output is byte-identical (verified by the unchanged 10 pre-existing goldens). The mixin is opt-in via leaf-overlayspec.mixins: [os-talos]; no leaf overlay is changed in this PR (recipe-author territory). Revert is a singlegit revertof the merge commit.Checklist
make testwith-race)make lint)git commit -S)