fix(examples): remove stale nvsentinel manifest reference in aks-training#716
fix(examples): remove stale nvsentinel manifest reference in aks-training#716yuanchen8911 wants to merge 1 commit into
Conversation
…ning The example references `components/nvsentinel/manifests/allow-intra- namespace.yaml`, which does not exist in the tree, so `aicr bundle examples/recipes/aks-training.yaml` fails: failed to load manifest components/nvsentinel/manifests/ allow-intra-namespace.yaml for component nvsentinel: open ...: file does not exist Bug history: - NVIDIA#68 (`fix: add NetworkPolicy workaround for nvsentinel metrics- access restriction`) introduced the manifest as a workaround for nvsentinel < v0.7.0, which shipped a `metrics-access` NetworkPolicy with `podSelector: {}` that blocked intra-namespace traffic (DCGM port 5555) and API-server -> webhook traffic (cert-manager port 443). - NVIDIA#309 (`feat: move to latest NVSentinel`) bumped nvsentinel past v0.7.0 and removed the workaround source file. The reference in the existing examples was cleaned up at that time. - NVIDIA#415 (`feat: Add AKS H100 recipe overlays`) added the new AKS example by copying the template structure but kept the now- orphaned `manifestFiles:` block. The reference has been broken since NVIDIA#415 merged. The nvsentinel pin in this example is `v0.10.0`, well past the v0.7.0 cutoff, so the workaround is irrelevant. Removing the two orphaned lines lets the example bundle cleanly. Verified locally: aicr bundle -r examples/recipes/aks-training.yaml -o /tmp/bundle ... succeeds, no missing-manifest error
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThe pull request modifies the Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
|
Absorbing this fix into #715 — the 2-line deletion now lands as part of the phase-1 recipe bumps PR (commit |
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. - recipes/registry.yaml: also fix the gke-nccl-tcpxo registry entry to use the established manifest-only Helm pattern (empty `helm.defaultRepository` plus `defaultNamespace: kube-system`) instead of the unparsed `manifest:` block. The `manifest:` field is not on the ComponentConfig struct, so its `defaultNamespace` was silently ignored. Pre-NVIDIA#706 this was inert (manifest-only components were installed via raw `kubectl apply`, which routed via inline `metadata.namespace`). After NVIDIA#706 wraps every component as a local Helm chart, the generated install.sh emits `--namespace --create-namespace` (empty) and Helm fails. This blocks every post-NVIDIA#706 GKE-COS H100 KWOK training run, including this PR's CI which auto-promotes the GKE-COS Tier-2 matrix when registry.yaml or base.yaml change. Switches to the same pattern used by `nodewright-customizations`. Verified bundled install.sh now contains `--namespace kube-system`. Supersedes NVIDIA#718. Refs: NVIDIA#698 Closes: NVIDIA#716, NVIDIA#718
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Refs: NVIDIA#698 Closes: NVIDIA#716
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Refs: NVIDIA#698 Closes: NVIDIA#716
Summary
Remove an orphaned
manifestFiles:reference inexamples/recipes/aks-training.yamlthat points to a non-existent file (components/nvsentinel/manifests/allow-intra-namespace.yaml). The reference has been broken since the AKS example was added.Motivation / Context
aicr bundle examples/recipes/aks-training.yamlcurrently fails:Bug history
fix: add NetworkPolicy workaround for nvsentinel metrics-access restriction— introduced the manifest as a workaround for nvsentinel < v0.7.0, which shipped ametrics-accessNetworkPolicy withpodSelector: {}that isolated all pods in the namespace and blocked:5555)443)feat: move to latest NVSentinel— bumped nvsentinel past v0.7.0 (the upstream NVSentinel/PR#789 fix landed) and removed the workaround source file. References in existing examples were cleaned up at that time.feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays— added the new AKS example by copying the template structure from another example, but kept the now-orphanedmanifestFiles:block. The reference has been broken since this PR merged.The nvsentinel pin in this example is
v0.10.0, well past thev0.7.0cutoff, so the workaround is irrelevant. Removing the two orphaned lines lets the example bundle cleanly.Surfaced during phase 1 of the recipe version-bump campaign (#698 / #715), where Codex's verification flagged this as a pre-existing bundling failure on
mainnot introduced by the bump PR. Filing as a standalone fix per scoping discussion.Fixes: bundling failure on
aicr bundle examples/recipes/aks-training.yamlRelated: #698, #715
Type of Change
Component(s) Affected
pkg/recipe) — example onlyImplementation Notes
Two-line deletion. No values or version changes. Other example recipes (
kind.yaml,eks-training.yaml,eks-gb200-ubuntu-training-with-validation.yaml) bundle cleanly and don't reference the missing manifest.Testing
Risk Assessment
Rollout notes: None. Example file only.
Checklist
make testwith-race)make lint)git commit -S)