Skip to content

fix(validator): per-run RBAC names to prevent concurrent-run races#888

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/validator-rbac-per-run-names
May 14, 2026
Merged

fix(validator): per-run RBAC names to prevent concurrent-run races#888
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/validator-rbac-per-run-names

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Suffix the validator ServiceAccount and ClusterRoleBinding names with the per-run runID so two concurrent aicr validate invocations against the same cluster do not delete each other's RBAC during end-of-run cleanup.

Motivation / Context

When two aicr validate runs overlap on the same namespace, the second run's later validators trip FailedCreate: serviceaccount "aicr-validator" not found for ~10 minutes until each per-check AICR_CHECK_TIMEOUT fires. Root cause: both runs share the fixed aicr-validator SA + CRB. When run A finishes and runs CleanupRBAC, run B is mid-flight and loses its SA — every subsequent inner-Job pod-create fails.

The data ConfigMaps already encode the runID (aicr-snapshot-<runID>, aicr-validation-<runID>) to avoid exactly this class of race. RBAC was the singleton hole.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Validator (pkg/validator)
  • Docs/examples (docs/, examples/)

Implementation Notes

ServiceAccountName and ClusterRoleBindingName change from exported package-level constants ("aicr-validator") to functions of runID returning "aicr-validator-<runID>". EnsureRBAC and CleanupRBAC take a runID parameter; Validator.RunID is plumbed through prepareCluster / deferClusterCleanup, and Deployer.buildPodSpecApply uses ServiceAccountName(d.runID) so per-validator Jobs mount the right SA.

  • Public API: the two exported ServiceAccountName / ClusterRoleBindingName constants become functions. No external consumers in the codebase.
  • Labels: app.kubernetes.io/name=aicr-validator is unchanged, so label-based discovery still works.
  • ADR-002 (docs/design/002-validatorv2-adr.md) gains an inline Implementation note (2026-05-14) block — the original design described fixed names, and readers should know why the names now carry a suffix.
  • tests/e2e/run.sh (used by make e2e-tilt, not make qualify) is updated to discover the SA/CRB by label rather than by literal name.

Testing

make qualify — green (test + lint + e2e + scan).

New regression test:

  • TestEnsureRBACConcurrentRunsIndependent — runs EnsureRBAC twice with two different runIDs against the same namespace, then CleanupRBAC only run A, and asserts run B's SA and CRB are still present.

Existing RBAC tests updated to thread a runID through EnsureRBAC / CleanupRBAC.

Coverage (affected packages): no decrease — new function paths exercised by the same existing tests plus the new regression test.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: No CLI flags or APIs change. On-cluster resource names change (aicr-validatoraicr-validator-<runID>), so any external automation that looks up the SA/CRB by literal name (rather than by app.kubernetes.io/name=aicr-validator label) needs an update. None known in-tree besides tests/e2e/run.sh, fixed here.

Checklist

  • Tests pass locally (make test with -race)
  • make qualify passes
  • Docs updated (ADR-002 implementation note)
  • Cryptographically signed (-S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 091d0c4c-adc3-4a9c-8490-879c9094158f

📥 Commits

Reviewing files that changed from the base of the PR and between 3592ff3 and 589f43e.

📒 Files selected for processing (8)
  • docs/design/002-validatorv2-adr.md
  • pkg/cli/validate.go
  • pkg/validator/job/deployer.go
  • pkg/validator/job/deployer_test.go
  • pkg/validator/job/rbac.go
  • pkg/validator/job/rbac_test.go
  • pkg/validator/validator.go
  • tests/e2e/run.sh

📝 Walkthrough

Walkthrough

This PR makes RBAC resource naming run-scoped by suffixing ServiceAccount and ClusterRoleBinding names with a per-run identifier. It converts RBAC name constants into helper functions that accept runID, updates EnsureRBAC/CleanupRBAC to require runID, sets the deployer PodSpec to use the run-scoped ServiceAccount, updates validator lifecycle wiring to pass v.RunID, and revises tests, E2E scripts, CLI messaging, and the ADR to discover resources by the stable label app.kubernetes.io/name=aicr-validator instead of fixed object names.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

size/L, area/cli

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing concurrent-run RBAC races by using per-run suffixed names.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, implementation, and testing for the per-run RBAC naming change.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

The validator orchestrator's ServiceAccount and ClusterRoleBinding had
fixed names (aicr-validator). Two concurrent `aicr validate` runs
against the same namespace (or sharing the cluster, for the CRB)
would clobber each other: run A's end-of-run CleanupRBAC deleted the
SA while run B was still queuing validator Jobs, causing FailedCreate
loops until the per-check backoffLimit timeout.

Suffix the SA and CRB names with the per-run runID — the same pattern
already used for the snapshot/validation ConfigMaps. Each run holds
disjoint resources, and cleanup only deletes what that run created.

The `app.kubernetes.io/name=aicr-validator` label is unchanged, so
discovery tooling that matches by label keeps working. `tests/e2e/run.sh`
is updated to look up the SA/CRB by that label instead of literal name.

New regression test `TestEnsureRBACConcurrentRunsIndependent` creates
two runs against the same namespace, cleans up one, and asserts the
other's resources survive.
@yuanchen8911 yuanchen8911 force-pushed the fix/validator-rbac-per-run-names branch from 3592ff3 to 589f43e Compare May 14, 2026 18:32
@yuanchen8911 yuanchen8911 merged commit 02ac07c into NVIDIA:main May 14, 2026
35 of 36 checks passed
@yuanchen8911 yuanchen8911 deleted the fix/validator-rbac-per-run-names branch May 14, 2026 19:29
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 14, 2026
Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants