docs: fix audit findings across docs, demos, examples#896
Conversation
|
🌿 Preview your docs: https://nvidia-preview-fix-docs-audit-2026-05-14.docs.buildwithfern.com/aicr |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR updates documentation and example recipes across the repository: expands supported platform enum values, updates API/doc references and Cache-Control examples, adds NodeTopology to CLI snapshot docs and a factory method, documents validator context and per-run RBAC naming, bumps GPU operator and related chart versions in examples, adds kube-prometheus-stack to recipes, adjusts demos (placeholders, TMPDIR, toleration/selector text), standardizes GitHub/version placeholders, and reorganizes CNCF evidence collector paths. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@demos/e2e.md`:
- Line 51: The example uses a variable-backed criteria path in one command but a
hardcoded /tmp path later; update the later command that references
/tmp/criteria.yaml to use the same "${TMPDIR:-/tmp}/criteria.yaml" form so the
criteria-file path is consistent across the flow (i.e., replace
/tmp/criteria.yaml with "${TMPDIR:-/tmp}/criteria.yaml" in the subsequent aicr
command/example).
In `@docs/conformance/cncf/index.md`:
- Around line 84-90: The fenced code block that starts with ```bash immediately
follows the paragraph "Alternatively, run the evidence collection script
directly." which triggers markdownlint MD031; fix it by inserting a blank line
before the fenced code block (and keep the existing blank line after) so the
block is separated from the preceding paragraph—edit the fenced code block in
the cncf index content (the triple-backtick fence containing the
collect-evidence.sh examples) to ensure a blank line precedes it.
In `@docs/contributor/api-server.md`:
- Line 1076: The sentence saying "The `platform` criteria field has no allowlist
env var today; all platform values are accepted" is misleading because enum
validation prevents arbitrary strings; update the documentation text for the
`platform` criteria field to state that there is currently no environment-based
allowlist and that only the predefined enum values (the allowed platform enum)
are accepted—clarify the expected failure mode when an invalid enum value is
provided (validation error) and remove wording that implies arbitrary strings
are permitted.
In `@docs/contributor/cli.md`:
- Line 307: The snapshot example uses the non-canonical measurement name
"topology"; update the example string so the measurements array uses the
canonical type name "NodeTopology" instead of "topology" (the example line
containing E --> F["Snapshot Structure...measurements: [k8s, systemd, os, gpu,
topology]"] should be changed to use NodeTopology) so docs match the code-facing
contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a9215fcd-3b87-451a-9479-fb6abe4400a2
📒 Files selected for processing (31)
README.mddemos/README.mddemos/cuj1-eks.mddemos/cuj2-demo.mddemos/cuj2.mddemos/data.mddemos/e2e.mddemos/examples/CUJ2-Test-Report.mddemos/valid.mddocs/README.mddocs/conformance/cncf/index.mddocs/contributor/api-server.mddocs/contributor/cli.mddocs/contributor/index.mddocs/contributor/validations.mddocs/contributor/validator.mddocs/integrator/aks-gpu-setup.mddocs/integrator/automation.mddocs/integrator/data-flow.mddocs/integrator/index.mddocs/integrator/recipe-development.mddocs/user/agent-deployment.mddocs/user/api-reference.mddocs/user/cli-reference.mddocs/user/component-catalog.mddocs/user/installation.mddocs/user/validation.mdexamples/recipes/README.mdexamples/recipes/eks-gb200-ubuntu-training-with-validation.yamlexamples/recipes/eks-training.yamlexamples/recipes/kind.yaml
💤 Files with no reviewable changes (1)
- README.md
2295c97 to
3d8cefa
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@demos/cuj1-eks.md`:
- Around line 50-52: The selector/toleration values are mismatched:
--system-node-selector uses nodeGroup=system-worker while the
--system-node-toleration entries reference dedicated=system-workload; either
make this explicit or align them. Fix by changing the toleration keys/values to
match the selector (e.g., use dedicated=system-worker) or update the selector to
nodeGroup=system-workload, or add a short comment clarifying the intentional
difference; update the occurrences of --system-node-selector and
--system-node-toleration so the selector label and taint key/value correspond.
In `@docs/contributor/cli.md`:
- Line 307: The diagram's "Detailed Data Flow" block is missing the NodeTopology
collection step and thus is inconsistent with the measurements list; update the
flow diagram to add a sixth collection goroutine for NodeTopology (so the
collection goroutines show metadata, Kubernetes, SystemD, OS, GPU, NodeTopology)
and add the NodeTopology label to the relevant diagram node(s) (the node that
currently lists measurements and the block that enumerates the collection
goroutines) so the diagram matches the measurements array and the narrative;
ensure any references to five collectors (e.g., the goroutine list) are changed
to include NodeTopology.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6d43b174-94bd-4231-95ad-6261687eeb79
📒 Files selected for processing (31)
README.mddemos/README.mddemos/cuj1-eks.mddemos/cuj2-demo.mddemos/cuj2.mddemos/data.mddemos/e2e.mddemos/examples/CUJ2-Test-Report.mddemos/valid.mddocs/README.mddocs/conformance/cncf/index.mddocs/contributor/api-server.mddocs/contributor/cli.mddocs/contributor/index.mddocs/contributor/validations.mddocs/contributor/validator.mddocs/integrator/aks-gpu-setup.mddocs/integrator/automation.mddocs/integrator/data-flow.mddocs/integrator/index.mddocs/integrator/recipe-development.mddocs/user/agent-deployment.mddocs/user/api-reference.mddocs/user/cli-reference.mddocs/user/component-catalog.mddocs/user/installation.mddocs/user/validation.mdexamples/recipes/README.mdexamples/recipes/eks-gb200-ubuntu-training-with-validation.yamlexamples/recipes/eks-training.yamlexamples/recipes/kind.yaml
💤 Files with no reviewable changes (1)
- README.md
3d8cefa to
22cfc2d
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
docs/contributor/cli.md (1)
299-305:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the Detailed Data Flow diagram to include NodeTopology collection goroutine.
The snapshot measurements list (lines 307, 315) now includes
NodeTopology, but the "Detailed Data Flow" diagram still shows only five collection goroutines (Metadata, Kubernetes, SystemD, OS, GPU). Add a sixth goroutine node for NodeTopology collection to keep the diagram consistent with the documented measurements.As per coding guidelines: "Document system design, architecture decisions, and component interactions in contributor documentation."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/contributor/cli.md` around lines 299 - 305, The diagram is missing the NodeTopology collection goroutine node; add a sixth node (e.g., D6) labeled like "Go Routine 6: NodeTopology<br/>• NodeTopology" (or similar descriptive bullet items) and update the final merge arrow so D1, D2, D3, D4, D5, and D6 all point to E["All goroutines complete<br/>or first error returns"] to keep the "Detailed Data Flow" diagram consistent with the snapshot measurements that include NodeTopology.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@docs/contributor/cli.md`:
- Around line 299-305: The diagram is missing the NodeTopology collection
goroutine node; add a sixth node (e.g., D6) labeled like "Go Routine 6:
NodeTopology<br/>• NodeTopology" (or similar descriptive bullet items) and
update the final merge arrow so D1, D2, D3, D4, D5, and D6 all point to E["All
goroutines complete<br/>or first error returns"] to keep the "Detailed Data
Flow" diagram consistent with the snapshot measurements that include
NodeTopology.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c48ad5f1-f110-4548-a2f2-f967125eefa5
📒 Files selected for processing (31)
README.mddemos/README.mddemos/cuj1-eks.mddemos/cuj2-demo.mddemos/cuj2.mddemos/data.mddemos/e2e.mddemos/examples/CUJ2-Test-Report.mddemos/valid.mddocs/README.mddocs/conformance/cncf/index.mddocs/contributor/api-server.mddocs/contributor/cli.mddocs/contributor/index.mddocs/contributor/validations.mddocs/contributor/validator.mddocs/integrator/aks-gpu-setup.mddocs/integrator/automation.mddocs/integrator/data-flow.mddocs/integrator/index.mddocs/integrator/recipe-development.mddocs/user/agent-deployment.mddocs/user/api-reference.mddocs/user/cli-reference.mddocs/user/component-catalog.mddocs/user/installation.mddocs/user/validation.mdexamples/recipes/README.mdexamples/recipes/eks-gb200-ubuntu-training-with-validation.yamlexamples/recipes/eks-training.yamlexamples/recipes/kind.yaml
💤 Files with no reviewable changes (1)
- README.md
22cfc2d to
d3d4da3
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/cli.md`:
- Line 309: Update the Snapshot Structure example's measurements list to use the
canonical measurement type names; locate the "Snapshot Structure" line (the
diagram node labeled E --> F["Snapshot Structure..."] showing measurements:
[k8s, systemd, os, gpu, NodeTopology]) and change the entries to the documented
canonical casing: [K8s, SystemD, OS, GPU, NodeTopology] so examples match the
documented types.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: be941cb9-2f1f-4b43-87f3-68496e799204
📒 Files selected for processing (31)
README.mddemos/README.mddemos/cuj1-eks.mddemos/cuj2-demo.mddemos/cuj2.mddemos/data.mddemos/e2e.mddemos/examples/CUJ2-Test-Report.mddemos/valid.mddocs/README.mddocs/conformance/cncf/index.mddocs/contributor/api-server.mddocs/contributor/cli.mddocs/contributor/index.mddocs/contributor/validations.mddocs/contributor/validator.mddocs/integrator/aks-gpu-setup.mddocs/integrator/automation.mddocs/integrator/data-flow.mddocs/integrator/index.mddocs/integrator/recipe-development.mddocs/user/agent-deployment.mddocs/user/api-reference.mddocs/user/cli-reference.mddocs/user/component-catalog.mddocs/user/installation.mddocs/user/validation.mdexamples/recipes/README.mdexamples/recipes/eks-gb200-ubuntu-training-with-validation.yamlexamples/recipes/eks-training.yamlexamples/recipes/kind.yaml
💤 Files with no reviewable changes (1)
- README.md
d3d4da3 to
3e0a231
Compare
|
@yuanchen8911 this PR now has merge conflicts with |
Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming introduced in PR NVIDIA#888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → <aicr-version> placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → <release-tag> placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-NVIDIA#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green.
3e0a231 to
50e6eb3
Compare
…g note Follow-up to PR NVIDIA#866 (Slinky slurm-operator). Backfills platform-enum drift on surfaces the original PR did not catch, plus surfaces the chart v1.1.0 nodeSelector silent-ignore limitation on the user-facing component-catalog row. PR NVIDIA#896 (yuanchen8911) already fixed docs/README.md and docs/contributor/validations.md as part of a broader docs audit. PR NVIDIA#893 removed the site/docs/ vitepress mirror entirely. Remaining surfaces this PR addresses: - docs/contributor/data.md:130 — platform row in criteria table - docs/user/component-catalog.md:36 — slinky-slurm-operator row, appended **Known limitation:** clause documenting chart v1.1.0 silent-ignore of operator.nodeSelector/webhook.nodeSelector (the same limitation is already inline in recipes/registry.yaml; this surfaces it on the rendered user catalog) - pkg/api/doc.go:72 — REST API godoc was missing slurm entirely - pkg/recipe/doc.go:32 — struct shape comment reordered to alphabetical to match GetCriteriaPlatformTypes() - pkg/recipe/doc.go:93-98 — CriteriaPlatform* constants reordered alphabetical - pkg/recipe/criteria.go:246 — Platform-field godoc updated - .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR snapshot-analysis skill criteria table All now list 'dynamo, kubeflow, nim, slurm' alphabetically matching pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum). New guard test ============== pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes() stay in sync. Mechanically catches the exact class of drift this commit fixes if a future platform value is added to one but not the other. Addressed reviews ================= @mchmarny (NVIDIA#866) — Doc-audit gap. cli-reference and api-reference were fixed at merge time and NVIDIA#896 picked up README/validations; this PR catches the remaining surfaces (contributor/data, the three Go godoc files, and the internal skill table). @coderabbitai (NVIDIA#866) — NodeSelector limitation note on the slinky-slurm-operator catalog row. Internal PE + QA + DA panel review on draft NVIDIA#884 — extended audit to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering), .claude/skills/analyzing-snapshots/SKILL.md, plus the guard test in pkg/recipe/doc_test.go. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Summary
Sweep of doc-quality issues uncovered by an audit of
docs/,demos/, andexamples/against currentmain. 31 files changed (+208/-105) after two rounds of review feedback. Doc-only.Motivation / Context
An audit run across
docs/user/,docs/contributor/,docs/integrator/,docs/conformance/,docs/design/,demos/, andexamples/surfaced a mix of: example recipes that fail bundle generation, code drift in contributor docs (struct fields, interface methods, env vars, cache TTLs, endpoint lists), enum lists that omit values, stale version examples, slug-breaking heading shapes, and a handful of typos. Each finding was verified against current source before editing.Fixes: N/A
Related: bundles on top of #888 (per-run RBAC), #889 (CNCF feature=all timeout), #871 (kgateway → agentgateway).
Type of Change
Component(s) Affected
docs/,examples/)Implementation Notes
ERROR (breaking or contradicts current code):
examples/recipes/eks-training.yaml+eks-gb200-ubuntu-training-with-validation.yaml:deploymentOrderlistedkube-prometheus-stackwith nocomponentRefsentry → bundle generation rejected. Added the entry.eks-gb200-...with-validation.yaml: performance phase referenced nonexistent checks (nccl-bandwidth-test,fabric-health-check) and unknowninfrastructure: nccl-doctor→ replaced with the realnccl-all-reduce-bw-net/-nvlsfrom the gb200-eks-training overlay, with matching exact-named constraints (>= 40/>= 500) so the variant checks actually exercise instead of skipping. Bumpedgpu-operatorv25.3.3 → v26.3.1 andnvidia-dra-driver-gpu25.8.1 → 25.12.0 to registry defaults; bumped the matching deployment-phase constraint and remediation text to keep the example self-consistent.docs/contributor/validator.md:validators.Contextstruct fieldRecipe *recipe.RecipeResult(removed) →ValidationInput *v1.ValidationInput; gated-skip example updated. Added an RBAC subsection documenting per-runaicr-validator-<runID>naming (PR fix(validator): per-run RBAC names to prevent concurrent-run races #888) and theapp.kubernetes.io/name=aicr-validatorlabel-selector pattern for cleanup tooling. NCCL performance-check table now shows the variant-specific constraint names (nccl-all-reduce-bw-net/-nvls) plus a callout on theconstraintNameForVariantcontract so recipe authors don't get silent Skips.docs/contributor/api-server.md:Cache-Control: max-age=300→600(matchesdefaults.RecipeCacheTTL = 10m); root endpointrouteslist expanded from[/v1/recipe]to all three registered routes; Query parameter table gained the missingplatformrow; broken anchorindex.md#cicd-architecturereplaced with a working link; platform-allowlist wording clarified to "valid enum values" (not arbitrary strings).docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (addedCreateNodeTopologyCollector); snapshot measurements list and Mermaid flow diagram both gainedNodeTopology(canonical type name + corresponding collector + parallel goroutine); bogus[INTERNAL]-wrapped invalid-accelerator example corrected — now shows actual current output with[INVALID_REQUEST]wrapping andexitCode=2(verified by running the CLI); duplicate log line and empty section removed.docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table;--best-effort(a real flag) is no longer mis-used as the "rejected typo" example.docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd,slinky-slurm-operator,slinky-slurm-operator-crds) and was re-sorted alphabetically.docs/user/validation.md: new--featuresection now correctly describes the flag as scoping the CNCF-submission evidence collector (requires--cncf-submission+--evidence-dir), not the conformance-phase validator run.docs/conformance/cncf/index.md: corrected script path topkg/evidence/cncf/scripts/collect-evidence.shand refreshed the directory tree; section-name list updated to match the script's case block; markdownlint MD031 blank-line fix around fenced code block.INCONSISTENCY / drift:
platformenum was(kubeflow)→ full list(dynamo, kubeflow, nim, slurm)indocs/README.md,docs/contributor/validations.md.(ubuntu, rhel)→ full list(ubuntu, rhel, cos, amazonlinux, talos)indocs/integrator/data-flow.md.gpu-operatorillustrative pins (v25.3.x) bumped tov26.3.1indocs/integrator/{recipe-development,automation,data-flow}.md.demos/cuj1-eks.md:--system-node-selector dedicated=system-workload(a taint key as a label selector) →nodeGroup=system-worker; matching prose fix; added a clarifying paragraph noting that the AICR reference clusters intentionally use distinct selector vs. toleration keys.demos/cuj2.mdtoleration grammar field[operation]→[effect].demos/cuj2-demo.mdgpu-operatorpin updated; ASCII box borders re-aligned.demos/e2e.md/tmp/criteria.yaml→${TMPDIR:-/tmp}/criteria.yamlper project artifact-location convention (consistently across all examples in the file); stale "18 + 1 = 19" component count → placeholders.demos/data.md"Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm".demos/README.md,docs/integrator/index.md,docs/README.md: index/page tables gained missing rows.STYLE / slug hygiene (per
.claude/CLAUDE.mddoc-style rules):docs/user/cli-reference.md: stripped trailing(--flag)parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides). Deploy/Undeploy Script Behavior variants left for a coordinated follow-up — their slugs are referenced from ~23 places in bundle templates and golden-files.docs/contributor/api-server.md: three headings rewritten with "and" in place of&to fix double-hyphen slugs.docs/integrator/aks-gpu-setup.md: dropped trailing(Important)from a heading.TYPO / NIT:
docs/README.md).docs/integrator/recipe-development.md).demos/cuj2.md).demos/valid.md).***Note:***→**Note:**(docs/user/component-catalog.md).github.com/nvidia/aicr→github.com/NVIDIA/aicrin user docs.<aicr-version>,<release-tag>, etc.) in user-facing examples.580.82.07driver pin in examples README bumped to current580.105.08.metadata.version: v0.26.7-next→devin hand-written example recipes (matchesaks-training.yamlconvention).demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-chore(recipes): migrate kgateway -> agentgateway for v2.2 inference routing #871 historical capture.New content (small additions where missing):
docs/user/cli-reference.md--featurerow.docs/user/validation.md.Review feedback addressed
Two rounds of review on this PR, both folded into the same squash:
CodeRabbit (round 1, 4 minor / quick-win):
demos/e2e.md: criteria-file path now${TMPDIR:-/tmp}/criteria.yamlconsistently (was mixed with literal/tmp/).docs/conformance/cncf/index.md: MD031 blank line added before fenced code block.docs/contributor/api-server.md: platform-allowlist wording reworded to "valid platform enum values" (not "all platform values").docs/contributor/cli.md: snapshot measurement type renamed fromtopologyto canonicalNodeTopology.Codex (round 2, three P2 + one P3):
docs/user/validation.md--featuresection: rewrote to describe CNCF-submission scoping (CLI rejects--featurewithout--cncf-submission+--evidence-dir), not conformance-phase scoping.docs/contributor/validator.md: NCCL performance-check Constraints column now shows variant-specific names + a callout explaining theconstraintNameForVariantcontract.docs/contributor/cli.mdinvalid-accelerator example: replaced with actual current CLI output (less wrapping;exitCode=2, not8). Verified by runningaicr recipe --accelerator invalid-gpu.examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: added matchingnccl-all-reduce-bw-net >= 40andnccl-all-reduce-bw-nvls >= 500constraints so the variant checks actually run.CodeRabbit (round 3, 2 minor):
demos/cuj1-eks.md: added a clarifying paragraph noting the AICR reference clusters intentionally use distinct selector vs. toleration keys.docs/contributor/cli.mdMermaid flow: addedNodeTopologyCollectorfactory step andGo Routine 6: NodeTopologyso the diagram matches the snapshot output.Skipped / out of scope
docs/design/*): frozen historical records. ADR-002 already has a 2026-05-14 implementation note for per-run RBAC; no rewrite.demos/cuj{1,2}-{eks,gke}.mdpodTemplateOverrides cleanup: both already explicitly note "No podTemplateOverrides / runtimePatches needed" sincetorch-distributedbakes scheduling at bundle time.site/docs/*: gitignored, auto-generated by Fern.Testing
Doc-only PR; full
make qualifyskipped perCLAUDE.local.md("doc-only / infra-only change ... cheap checks are enough").make lintcovers yamllint, gofmt, golangci-lint, license-check, sidebar check, agents-sync, and chart-pin verification — all green.Spot-verified each finding against current source (
pkg/recipe/criteria.go,pkg/recipe/oskind/oskind.go,pkg/recipe/allowlist.go,pkg/api/server.go,pkg/collector/factory.go,pkg/evidence/cncf/collector.go,pkg/defaults/timeouts.go,pkg/errors/exitcode.go,validators/context.go,validators/performance/nccl_all_reduce_bw.go,recipes/registry.yaml,recipes/validators/catalog.yaml,recipes/overlays/gb200-eks-training.yaml) before editing. The invalid-accelerator CLI example was verified by running the actual command against the latest binary.Risk Assessment
eks-training.yaml(previously rejected). No CLI/API/code changes.Rollout notes: None.
Checklist
make lintpasses locallyCo-Authored-Bylines-S)