fix(validator): per-run RBAC names to prevent concurrent-run races#888
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughThis PR makes RBAC resource naming run-scoped by suffixing ServiceAccount and ClusterRoleBinding names with a per-run identifier. It converts RBAC name constants into helper functions that accept Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
a1ae783 to
3592ff3
Compare
The validator orchestrator's ServiceAccount and ClusterRoleBinding had fixed names (aicr-validator). Two concurrent `aicr validate` runs against the same namespace (or sharing the cluster, for the CRB) would clobber each other: run A's end-of-run CleanupRBAC deleted the SA while run B was still queuing validator Jobs, causing FailedCreate loops until the per-check backoffLimit timeout. Suffix the SA and CRB names with the per-run runID — the same pattern already used for the snapshot/validation ConfigMaps. Each run holds disjoint resources, and cleanup only deletes what that run created. The `app.kubernetes.io/name=aicr-validator` label is unchanged, so discovery tooling that matches by label keeps working. `tests/e2e/run.sh` is updated to look up the SA/CRB by that label instead of literal name. New regression test `TestEnsureRBACConcurrentRunsIndependent` creates two runs against the same namespace, cleans up one, and asserts the other's resources survive.
3592ff3 to
589f43e
Compare
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
Summary
Suffix the validator
ServiceAccountandClusterRoleBindingnames with the per-runrunIDso two concurrentaicr validateinvocations against the same cluster do not delete each other's RBAC during end-of-run cleanup.Motivation / Context
When two
aicr validateruns overlap on the same namespace, the second run's later validators tripFailedCreate: serviceaccount "aicr-validator" not foundfor ~10 minutes until each per-checkAICR_CHECK_TIMEOUTfires. Root cause: both runs share the fixedaicr-validatorSA + CRB. When run A finishes and runsCleanupRBAC, run B is mid-flight and loses its SA — every subsequent inner-Job pod-create fails.The data ConfigMaps already encode the runID (
aicr-snapshot-<runID>,aicr-validation-<runID>) to avoid exactly this class of race. RBAC was the singleton hole.Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
pkg/validator)docs/,examples/)Implementation Notes
ServiceAccountNameandClusterRoleBindingNamechange from exported package-level constants ("aicr-validator") to functions ofrunIDreturning"aicr-validator-<runID>".EnsureRBACandCleanupRBACtake arunIDparameter;Validator.RunIDis plumbed throughprepareCluster/deferClusterCleanup, andDeployer.buildPodSpecApplyusesServiceAccountName(d.runID)so per-validator Jobs mount the right SA.ServiceAccountName/ClusterRoleBindingNameconstants become functions. No external consumers in the codebase.app.kubernetes.io/name=aicr-validatoris unchanged, so label-based discovery still works.docs/design/002-validatorv2-adr.md) gains an inlineImplementation note (2026-05-14)block — the original design described fixed names, and readers should know why the names now carry a suffix.tests/e2e/run.sh(used bymake e2e-tilt, notmake qualify) is updated to discover the SA/CRB by label rather than by literal name.Testing
make qualify— green (test + lint + e2e + scan).New regression test:
TestEnsureRBACConcurrentRunsIndependent— runsEnsureRBACtwice with two different runIDs against the same namespace, thenCleanupRBAConly run A, and asserts run B's SA and CRB are still present.Existing RBAC tests updated to thread a
runIDthroughEnsureRBAC/CleanupRBAC.Coverage (affected packages): no decrease — new function paths exercised by the same existing tests plus the new regression test.
Risk Assessment
Rollout notes: No CLI flags or APIs change. On-cluster resource names change (
aicr-validator→aicr-validator-<runID>), so any external automation that looks up the SA/CRB by literal name (rather than byapp.kubernetes.io/name=aicr-validatorlabel) needs an update. None known in-tree besidestests/e2e/run.sh, fixed here.Checklist
make testwith-race)make qualifypasses-S)