Skip to content

docs: fix audit findings across docs, demos, examples#896

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/docs-audit-2026-05-14
May 14, 2026
Merged

docs: fix audit findings across docs, demos, examples#896
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/docs-audit-2026-05-14

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 14, 2026

Summary

Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. 31 files changed (+208/-105) after two rounds of review feedback. Doc-only.

Motivation / Context

An audit run across docs/user/, docs/contributor/, docs/integrator/, docs/conformance/, docs/design/, demos/, and examples/ surfaced a mix of: example recipes that fail bundle generation, code drift in contributor docs (struct fields, interface methods, env vars, cache TTLs, endpoint lists), enum lists that omit values, stale version examples, slug-breaking heading shapes, and a handful of typos. Each finding was verified against current source before editing.

Fixes: N/A
Related: bundles on top of #888 (per-run RBAC), #889 (CNCF feature=all timeout), #871 (kgateway → agentgateway).

Type of Change

  • Documentation update

Component(s) Affected

  • Docs/examples (docs/, examples/)

Implementation Notes

ERROR (breaking or contradicts current code):

  • examples/recipes/eks-training.yaml + eks-gb200-ubuntu-training-with-validation.yaml: deploymentOrder listed kube-prometheus-stack with no componentRefs entry → bundle generation rejected. Added the entry.
  • eks-gb200-...with-validation.yaml: performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and unknown infrastructure: nccl-doctor → replaced with the real nccl-all-reduce-bw-net / -nvls from the gb200-eks-training overlay, with matching exact-named constraints (>= 40 / >= 500) so the variant checks actually exercise instead of skipping. Bumped gpu-operator v25.3.3 → v26.3.1 and nvidia-dra-driver-gpu 25.8.1 → 25.12.0 to registry defaults; bumped the matching deployment-phase constraint and remediation text to keep the example self-consistent.
  • docs/contributor/validator.md: validators.Context struct field Recipe *recipe.RecipeResult (removed) → ValidationInput *v1.ValidationInput; gated-skip example updated. Added an RBAC subsection documenting per-run aicr-validator-<runID> naming (PR fix(validator): per-run RBAC names to prevent concurrent-run races #888) and the app.kubernetes.io/name=aicr-validator label-selector pattern for cleanup tooling. NCCL performance-check table now shows the variant-specific constraint names (nccl-all-reduce-bw-net / -nvls) plus a callout on the constraintNameForVariant contract so recipe authors don't get silent Skips.
  • docs/contributor/api-server.md: Cache-Control: max-age=300600 (matches defaults.RecipeCacheTTL = 10m); root endpoint routes list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing platform row; broken anchor index.md#cicd-architecture replaced with a working link; platform-allowlist wording clarified to "valid enum values" (not arbitrary strings).
  • docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (added CreateNodeTopologyCollector); snapshot measurements list and Mermaid flow diagram both gained NodeTopology (canonical type name + corresponding collector + parallel goroutine); bogus [INTERNAL]-wrapped invalid-accelerator example corrected — now shows actual current output with [INVALID_REQUEST] wrapping and exitCode=2 (verified by running the CLI); duplicate log line and empty section removed.
  • docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; --best-effort (a real flag) is no longer mis-used as the "rejected typo" example.
  • docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically.
  • docs/user/validation.md: new --feature section now correctly describes the flag as scoping the CNCF-submission evidence collector (requires --cncf-submission + --evidence-dir), not the conformance-phase validator run.
  • docs/conformance/cncf/index.md: corrected script path to pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block; markdownlint MD031 blank-line fix around fenced code block.

INCONSISTENCY / drift:

  • platform enum was (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md.
  • OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md.
  • Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md.
  • demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (a taint key as a label selector) → nodeGroup=system-worker; matching prose fix; added a clarifying paragraph noting that the AICR reference clusters intentionally use distinct selector vs. toleration keys.
  • demos/cuj2.md toleration grammar field [operation][effect].
  • demos/cuj2-demo.md gpu-operator pin updated; ASCII box borders re-aligned.
  • demos/e2e.md /tmp/criteria.yaml${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention (consistently across all examples in the file); stale "18 + 1 = 19" component count → placeholders.
  • demos/data.md "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm".
  • demos/README.md, docs/integrator/index.md, docs/README.md: index/page tables gained missing rows.

STYLE / slug hygiene (per .claude/CLAUDE.md doc-style rules):

  • docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides). Deploy/Undeploy Script Behavior variants left for a coordinated follow-up — their slugs are referenced from ~23 places in bundle templates and golden-files.
  • docs/contributor/api-server.md: three headings rewritten with "and" in place of & to fix double-hyphen slugs.
  • docs/integrator/aks-gpu-setup.md: dropped trailing (Important) from a heading.

TYPO / NIT:

  • "a automated" → "an automated" (docs/README.md).
  • "end to end" → "end-to-end" (docs/integrator/recipe-development.md).
  • "comma delimination" → "comma-separated" (demos/cuj2.md).
  • Sentence fragment completed (demos/valid.md).
  • ***Note:*****Note:** (docs/user/component-catalog.md).
  • github.com/nvidia/aicrgithub.com/NVIDIA/aicr in user docs.
  • Hard-coded version strings replaced with placeholders (<aicr-version>, <release-tag>, etc.) in user-facing examples.
  • Stale 580.82.07 driver pin in examples README bumped to current 580.105.08.
  • metadata.version: v0.26.7-nextdev in hand-written example recipes (matches aks-training.yaml convention).
  • demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-chore(recipes): migrate kgateway -> agentgateway for v2.2 inference routing #871 historical capture.

New content (small additions where missing):

  • 9-feature CNCF list added to docs/user/cli-reference.md --feature row.
  • "Scoping CNCF submission evidence to specific features" section added to docs/user/validation.md.

Review feedback addressed

Two rounds of review on this PR, both folded into the same squash:

CodeRabbit (round 1, 4 minor / quick-win):

  • demos/e2e.md: criteria-file path now ${TMPDIR:-/tmp}/criteria.yaml consistently (was mixed with literal /tmp/).
  • docs/conformance/cncf/index.md: MD031 blank line added before fenced code block.
  • docs/contributor/api-server.md: platform-allowlist wording reworded to "valid platform enum values" (not "all platform values").
  • docs/contributor/cli.md: snapshot measurement type renamed from topology to canonical NodeTopology.

Codex (round 2, three P2 + one P3):

  • docs/user/validation.md --feature section: rewrote to describe CNCF-submission scoping (CLI rejects --feature without --cncf-submission + --evidence-dir), not conformance-phase scoping.
  • docs/contributor/validator.md: NCCL performance-check Constraints column now shows variant-specific names + a callout explaining the constraintNameForVariant contract.
  • docs/contributor/cli.md invalid-accelerator example: replaced with actual current CLI output (less wrapping; exitCode=2, not 8). Verified by running aicr recipe --accelerator invalid-gpu.
  • examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: added matching nccl-all-reduce-bw-net >= 40 and nccl-all-reduce-bw-nvls >= 500 constraints so the variant checks actually run.

CodeRabbit (round 3, 2 minor):

  • demos/cuj1-eks.md: added a clarifying paragraph noting the AICR reference clusters intentionally use distinct selector vs. toleration keys.
  • docs/contributor/cli.md Mermaid flow: added NodeTopologyCollector factory step and Go Routine 6: NodeTopology so the diagram matches the snapshot output.

Skipped / out of scope

  • ADRs (docs/design/*): frozen historical records. ADR-002 already has a 2026-05-14 implementation note for per-run RBAC; no rewrite.
  • Deploy/Undeploy Script Behavior heading parentheticals: ~23 inbound anchor links across bundle templates and golden-files; coordinated PR.
  • demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup: both already explicitly note "No podTemplateOverrides / runtimePatches needed" since torch-distributed bakes scheduling at bundle time.
  • site/docs/*: gitignored, auto-generated by Fern.

Testing

make lint

Doc-only PR; full make qualify skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). make lint covers yamllint, gofmt, golangci-lint, license-check, sidebar check, agents-sync, and chart-pin verification — all green.

Spot-verified each finding against current source (pkg/recipe/criteria.go, pkg/recipe/oskind/oskind.go, pkg/recipe/allowlist.go, pkg/api/server.go, pkg/collector/factory.go, pkg/evidence/cncf/collector.go, pkg/defaults/timeouts.go, pkg/errors/exitcode.go, validators/context.go, validators/performance/nccl_all_reduce_bw.go, recipes/registry.yaml, recipes/validators/catalog.yaml, recipes/overlays/gb200-eks-training.yaml) before editing. The invalid-accelerator CLI example was verified by running the actual command against the latest binary.

Risk Assessment

  • Low — Doc-only. The only YAML changes are to example recipes; bundle generation is now possible against eks-training.yaml (previously rejected). No CLI/API/code changes.

Rollout notes: None.

Checklist

  • make lint passes locally
  • No Co-Authored-By lines
  • Cryptographically signed (-S)

@github-actions
Copy link
Copy Markdown
Contributor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR updates documentation and example recipes across the repository: expands supported platform enum values, updates API/doc references and Cache-Control examples, adds NodeTopology to CLI snapshot docs and a factory method, documents validator context and per-run RBAC naming, bumps GPU operator and related chart versions in examples, adds kube-prometheus-stack to recipes, adjusts demos (placeholders, TMPDIR, toleration/selector text), standardizes GitHub/version placeholders, and reorganizes CNCF evidence collector paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/aicr#895: Updates GB200/EKS recipe validation and gpu-operator version constraints related to example recipe changes.

Suggested labels

size/S

Suggested reviewers

  • mchmarny
  • pdmack
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change as a documentation audit fix sweep across docs/, demos/, and examples, which directly reflects the extensive documentation updates throughout the changeset.
Description check ✅ Passed The description comprehensively explains the audit findings, the specific fixes applied across all affected files, the implementation notes, and testing approach, clearly relating to the documentation changes in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@demos/e2e.md`:
- Line 51: The example uses a variable-backed criteria path in one command but a
hardcoded /tmp path later; update the later command that references
/tmp/criteria.yaml to use the same "${TMPDIR:-/tmp}/criteria.yaml" form so the
criteria-file path is consistent across the flow (i.e., replace
/tmp/criteria.yaml with "${TMPDIR:-/tmp}/criteria.yaml" in the subsequent aicr
command/example).

In `@docs/conformance/cncf/index.md`:
- Around line 84-90: The fenced code block that starts with ```bash immediately
follows the paragraph "Alternatively, run the evidence collection script
directly." which triggers markdownlint MD031; fix it by inserting a blank line
before the fenced code block (and keep the existing blank line after) so the
block is separated from the preceding paragraph—edit the fenced code block in
the cncf index content (the triple-backtick fence containing the
collect-evidence.sh examples) to ensure a blank line precedes it.

In `@docs/contributor/api-server.md`:
- Line 1076: The sentence saying "The `platform` criteria field has no allowlist
env var today; all platform values are accepted" is misleading because enum
validation prevents arbitrary strings; update the documentation text for the
`platform` criteria field to state that there is currently no environment-based
allowlist and that only the predefined enum values (the allowed platform enum)
are accepted—clarify the expected failure mode when an invalid enum value is
provided (validation error) and remove wording that implies arbitrary strings
are permitted.

In `@docs/contributor/cli.md`:
- Line 307: The snapshot example uses the non-canonical measurement name
"topology"; update the example string so the measurements array uses the
canonical type name "NodeTopology" instead of "topology" (the example line
containing E --> F["Snapshot Structure...measurements: [k8s, systemd, os, gpu,
topology]"] should be changed to use NodeTopology) so docs match the code-facing
contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a9215fcd-3b87-451a-9479-fb6abe4400a2

📥 Commits

Reviewing files that changed from the base of the PR and between 1f3e1c2 and 2295c97.

📒 Files selected for processing (31)
  • README.md
  • demos/README.md
  • demos/cuj1-eks.md
  • demos/cuj2-demo.md
  • demos/cuj2.md
  • demos/data.md
  • demos/e2e.md
  • demos/examples/CUJ2-Test-Report.md
  • demos/valid.md
  • docs/README.md
  • docs/conformance/cncf/index.md
  • docs/contributor/api-server.md
  • docs/contributor/cli.md
  • docs/contributor/index.md
  • docs/contributor/validations.md
  • docs/contributor/validator.md
  • docs/integrator/aks-gpu-setup.md
  • docs/integrator/automation.md
  • docs/integrator/data-flow.md
  • docs/integrator/index.md
  • docs/integrator/recipe-development.md
  • docs/user/agent-deployment.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • docs/user/component-catalog.md
  • docs/user/installation.md
  • docs/user/validation.md
  • examples/recipes/README.md
  • examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml
  • examples/recipes/eks-training.yaml
  • examples/recipes/kind.yaml
💤 Files with no reviewable changes (1)
  • README.md

Comment thread demos/e2e.md
Comment thread docs/conformance/cncf/index.md
Comment thread docs/contributor/api-server.md Outdated
Comment thread docs/contributor/cli.md Outdated
@yuanchen8911 yuanchen8911 force-pushed the fix/docs-audit-2026-05-14 branch from 2295c97 to 3d8cefa Compare May 14, 2026 20:35
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@demos/cuj1-eks.md`:
- Around line 50-52: The selector/toleration values are mismatched:
--system-node-selector uses nodeGroup=system-worker while the
--system-node-toleration entries reference dedicated=system-workload; either
make this explicit or align them. Fix by changing the toleration keys/values to
match the selector (e.g., use dedicated=system-worker) or update the selector to
nodeGroup=system-workload, or add a short comment clarifying the intentional
difference; update the occurrences of --system-node-selector and
--system-node-toleration so the selector label and taint key/value correspond.

In `@docs/contributor/cli.md`:
- Line 307: The diagram's "Detailed Data Flow" block is missing the NodeTopology
collection step and thus is inconsistent with the measurements list; update the
flow diagram to add a sixth collection goroutine for NodeTopology (so the
collection goroutines show metadata, Kubernetes, SystemD, OS, GPU, NodeTopology)
and add the NodeTopology label to the relevant diagram node(s) (the node that
currently lists measurements and the block that enumerates the collection
goroutines) so the diagram matches the measurements array and the narrative;
ensure any references to five collectors (e.g., the goroutine list) are changed
to include NodeTopology.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6d43b174-94bd-4231-95ad-6261687eeb79

📥 Commits

Reviewing files that changed from the base of the PR and between 2295c97 and 3d8cefa.

📒 Files selected for processing (31)
  • README.md
  • demos/README.md
  • demos/cuj1-eks.md
  • demos/cuj2-demo.md
  • demos/cuj2.md
  • demos/data.md
  • demos/e2e.md
  • demos/examples/CUJ2-Test-Report.md
  • demos/valid.md
  • docs/README.md
  • docs/conformance/cncf/index.md
  • docs/contributor/api-server.md
  • docs/contributor/cli.md
  • docs/contributor/index.md
  • docs/contributor/validations.md
  • docs/contributor/validator.md
  • docs/integrator/aks-gpu-setup.md
  • docs/integrator/automation.md
  • docs/integrator/data-flow.md
  • docs/integrator/index.md
  • docs/integrator/recipe-development.md
  • docs/user/agent-deployment.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • docs/user/component-catalog.md
  • docs/user/installation.md
  • docs/user/validation.md
  • examples/recipes/README.md
  • examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml
  • examples/recipes/eks-training.yaml
  • examples/recipes/kind.yaml
💤 Files with no reviewable changes (1)
  • README.md

Comment thread demos/cuj1-eks.md
mchmarny
mchmarny previously approved these changes May 14, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/contributor/cli.md (1)

299-305: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the Detailed Data Flow diagram to include NodeTopology collection goroutine.

The snapshot measurements list (lines 307, 315) now includes NodeTopology, but the "Detailed Data Flow" diagram still shows only five collection goroutines (Metadata, Kubernetes, SystemD, OS, GPU). Add a sixth goroutine node for NodeTopology collection to keep the diagram consistent with the documented measurements.

As per coding guidelines: "Document system design, architecture decisions, and component interactions in contributor documentation."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/contributor/cli.md` around lines 299 - 305, The diagram is missing the
NodeTopology collection goroutine node; add a sixth node (e.g., D6) labeled like
"Go Routine 6: NodeTopology<br/>• NodeTopology" (or similar descriptive bullet
items) and update the final merge arrow so D1, D2, D3, D4, D5, and D6 all point
to E["All goroutines complete<br/>or first error returns"] to keep the "Detailed
Data Flow" diagram consistent with the snapshot measurements that include
NodeTopology.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@docs/contributor/cli.md`:
- Around line 299-305: The diagram is missing the NodeTopology collection
goroutine node; add a sixth node (e.g., D6) labeled like "Go Routine 6:
NodeTopology<br/>• NodeTopology" (or similar descriptive bullet items) and
update the final merge arrow so D1, D2, D3, D4, D5, and D6 all point to E["All
goroutines complete<br/>or first error returns"] to keep the "Detailed Data
Flow" diagram consistent with the snapshot measurements that include
NodeTopology.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c48ad5f1-f110-4548-a2f2-f967125eefa5

📥 Commits

Reviewing files that changed from the base of the PR and between 3d8cefa and 22cfc2d.

📒 Files selected for processing (31)
  • README.md
  • demos/README.md
  • demos/cuj1-eks.md
  • demos/cuj2-demo.md
  • demos/cuj2.md
  • demos/data.md
  • demos/e2e.md
  • demos/examples/CUJ2-Test-Report.md
  • demos/valid.md
  • docs/README.md
  • docs/conformance/cncf/index.md
  • docs/contributor/api-server.md
  • docs/contributor/cli.md
  • docs/contributor/index.md
  • docs/contributor/validations.md
  • docs/contributor/validator.md
  • docs/integrator/aks-gpu-setup.md
  • docs/integrator/automation.md
  • docs/integrator/data-flow.md
  • docs/integrator/index.md
  • docs/integrator/recipe-development.md
  • docs/user/agent-deployment.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • docs/user/component-catalog.md
  • docs/user/installation.md
  • docs/user/validation.md
  • examples/recipes/README.md
  • examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml
  • examples/recipes/eks-training.yaml
  • examples/recipes/kind.yaml
💤 Files with no reviewable changes (1)
  • README.md

@yuanchen8911 yuanchen8911 force-pushed the fix/docs-audit-2026-05-14 branch from 22cfc2d to d3d4da3 Compare May 14, 2026 20:45
@mchmarny mchmarny enabled auto-merge (squash) May 14, 2026 20:47
mchmarny
mchmarny previously approved these changes May 14, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/cli.md`:
- Line 309: Update the Snapshot Structure example's measurements list to use the
canonical measurement type names; locate the "Snapshot Structure" line (the
diagram node labeled E --> F["Snapshot Structure..."] showing measurements:
[k8s, systemd, os, gpu, NodeTopology]) and change the entries to the documented
canonical casing: [K8s, SystemD, OS, GPU, NodeTopology] so examples match the
documented types.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: be941cb9-2f1f-4b43-87f3-68496e799204

📥 Commits

Reviewing files that changed from the base of the PR and between 22cfc2d and d3d4da3.

📒 Files selected for processing (31)
  • README.md
  • demos/README.md
  • demos/cuj1-eks.md
  • demos/cuj2-demo.md
  • demos/cuj2.md
  • demos/data.md
  • demos/e2e.md
  • demos/examples/CUJ2-Test-Report.md
  • demos/valid.md
  • docs/README.md
  • docs/conformance/cncf/index.md
  • docs/contributor/api-server.md
  • docs/contributor/cli.md
  • docs/contributor/index.md
  • docs/contributor/validations.md
  • docs/contributor/validator.md
  • docs/integrator/aks-gpu-setup.md
  • docs/integrator/automation.md
  • docs/integrator/data-flow.md
  • docs/integrator/index.md
  • docs/integrator/recipe-development.md
  • docs/user/agent-deployment.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • docs/user/component-catalog.md
  • docs/user/installation.md
  • docs/user/validation.md
  • examples/recipes/README.md
  • examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml
  • examples/recipes/eks-training.yaml
  • examples/recipes/kind.yaml
💤 Files with no reviewable changes (1)
  • README.md

Comment thread docs/contributor/cli.md Outdated
@github-actions
Copy link
Copy Markdown
Contributor

@yuanchen8911 this PR now has merge conflicts with main. Please rebase to resolve them.

Sweep of doc-quality issues uncovered by an audit of docs/, demos/,
and examples/ against current main. Fixes break into four buckets:

ERROR (breaking or contradicts code):
- examples/recipes/eks-training.yaml: deploymentOrder listed
  kube-prometheus-stack with no matching componentRefs entry — bundle
  generation rejected the recipe. Added the missing entry.
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
  defect; performance phase referenced nonexistent checks
  (nccl-bandwidth-test, fabric-health-check) and an unknown
  `infrastructure: nccl-doctor` line — replaced with the real
  nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training
  overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and
  nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults.
- docs/contributor/validator.md: validators.Context struct field
  `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput
  *v1.ValidationInput`; gated skip example updated to match. Added an
  RBAC subsection documenting per-run aicr-validator-<runID> naming
  introduced in PR NVIDIA#888.
- docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to
  match RecipeCacheTTL (10 min); root endpoint `routes` list expanded
  from [/v1/recipe] to all three registered routes; Query parameter
  table gained the missing `platform` row; broken anchor
  index.md#cicd-architecture replaced with a working link.
- docs/contributor/cli.md: Factory interface snippet went from 4 to 5
  methods (CreateNodeTopologyCollector was missing); snapshot
  measurements list gained `topology`; bogus [INTERNAL]-wrapped
  invalid-accelerator example corrected to [INVALID_REQUEST];
  duplicate log line and empty section removed.
- docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT)
  to the validate exit-code table; `--best-effort` (a real flag) is no
  longer used as the "rejected typo" example.
- docs/user/api-reference.md: Bundle Components table gained the three
  missing components (nfd, slinky-slurm-operator,
  slinky-slurm-operator-crds) and was re-sorted alphabetically.
- docs/conformance/cncf/index.md: corrected script path
  pkg/evidence/scripts/collect-evidence.sh →
  pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the
  directory tree; section-name list updated to match the script's
  case block.

INCONSISTENCY / drift:
- platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm)
  in docs/README.md, docs/contributor/validations.md.
- OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux,
  talos) in docs/integrator/data-flow.md.
- Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in
  docs/integrator/{recipe-development,automation,data-flow}.md.
- demos/cuj1-eks.md: --system-node-selector dedicated=system-workload
  (taint key as a label selector) → nodeGroup=system-worker; matching
  prose adjustment.
- demos/cuj2.md: toleration grammar field [operation] → [effect].
- demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box
  borders re-aligned.
- demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per
  project artifact-location convention; stale "18 + 1 = 19" component
  count replaced with placeholders.
- demos/data.md: "Asymmetric rule matching" → "Dependency-driven
  ordering based on Kahn's algorithm".
- demos/README.md: index table gained the missing CUJ pages.
- docs/integrator/index.md and docs/README.md: integrator-pages tables
  gained missing rows.

STYLE / slug hygiene (CLAUDE.md doc-style rules):
- docs/user/cli-reference.md: stripped trailing (--flag)
  parentheticals from three headings (Storage Class, Deployment
  Methods, Value Overrides) — the Deploy/Undeploy variants are left
  alone in this PR because their slugs have many inbound references in
  bundle templates and golden-files which would need a coordinated
  follow-up.
- docs/contributor/api-server.md: three headings rewritten with
  "and" in place of "&" to fix double-hyphen slugs.
- docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)"
  from a heading.

TYPO / NIT:
- docs/README.md "a automated" → "an automated".
- docs/integrator/recipe-development.md "end to end" → "end-to-end".
- demos/cuj2.md "comma delimination" → "comma-separated".
- demos/valid.md sentence fragment completed.
- docs/user/component-catalog.md `***Note:***` → `**Note:**`.
- docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr
  → github.com/NVIDIA/aicr.
- docs/user/agent-deployment.md and docs/user/cli-reference.md
  literal `app.kubernetes.io/version: v0.17.0` → <aicr-version>
  placeholder.
- docs/contributor/cli.md CronJob example pin v0.6.4 →
  <release-tag> placeholder.
- examples/recipes/README.md driver version 580.82.07 → 580.105.08 to
  match current gpu-operator pin.
- examples/recipes/{eks-training,eks-gb200-...,kind}.yaml
  metadata.version v0.26.7-next → dev (hand-written-example
  convention shared with aks-training.yaml).
- demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log
  as a pre-NVIDIA#871 historical capture (kgateway → agentgateway
  migration).

Also adds the missing CNCF feature list to docs/user/cli-reference.md
--feature row and a "Scoping conformance to specific features" section
in docs/user/validation.md, since the 9 ValidFeatures names from
pkg/evidence/cncf/collector.go were not documented for users.

Skipped (out of scope or risk-deferred):
- ADR-002 cross-references to per-run RBAC names beyond the existing
  2026-05-14 implementation note — ADRs are frozen historical records.
- Deploy Script Behavior / Undeploy Script Behavior heading
  parentheticals — ~23 inbound anchor links in bundle templates and
  golden-files would need a coordinated PR.
- demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both
  files already explicitly note "No podTemplateOverrides /
  runtimePatches needed" since the torch-distributed runtime bakes
  scheduling at bundle time.
- pre-existing site/docs/* renders (gitignored, auto-generated).

Doc-only PR; full `make qualify` skipped per CLAUDE.local.md
("doc-only / infra-only change ... cheap checks are enough"). Ran
`make lint` (yamllint, gofmt, license-check, sidebar check, agents
sync, chart-pin verification) — all green.
@yuanchen8911 yuanchen8911 force-pushed the fix/docs-audit-2026-05-14 branch from 3e0a231 to 50e6eb3 Compare May 14, 2026 20:58
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 14, 2026 21:03
@mchmarny mchmarny merged commit 1919e20 into NVIDIA:main May 14, 2026
32 checks passed
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request May 15, 2026
…g note

Follow-up to PR NVIDIA#866 (Slinky slurm-operator). Backfills platform-enum
drift on surfaces the original PR did not catch, plus surfaces the
chart v1.1.0 nodeSelector silent-ignore limitation on the user-facing
component-catalog row.

PR NVIDIA#896 (yuanchen8911) already fixed docs/README.md and
docs/contributor/validations.md as part of a broader docs audit. PR
NVIDIA#893 removed the site/docs/ vitepress mirror entirely. Remaining
surfaces this PR addresses:

- docs/contributor/data.md:130 — platform row in criteria table
- docs/user/component-catalog.md:36 — slinky-slurm-operator row,
  appended **Known limitation:** clause documenting chart v1.1.0
  silent-ignore of operator.nodeSelector/webhook.nodeSelector
  (the same limitation is already inline in recipes/registry.yaml;
  this surfaces it on the rendered user catalog)
- pkg/api/doc.go:72 — REST API godoc was missing slurm entirely
- pkg/recipe/doc.go:32 — struct shape comment reordered to alphabetical
  to match GetCriteriaPlatformTypes()
- pkg/recipe/doc.go:93-98 — CriteriaPlatform* constants reordered
  alphabetical
- pkg/recipe/criteria.go:246 — Platform-field godoc updated
- .claude/skills/analyzing-snapshots/SKILL.md:278 — internal AICR
  snapshot-analysis skill criteria table

All now list 'dynamo, kubeflow, nim, slurm' alphabetically matching
pkg/recipe.GetCriteriaPlatformTypes() (the authoritative Go enum).

New guard test
==============

pkg/recipe/doc_test.go::TestCriteriaPlatformConstantsMatchGetter
asserts the CriteriaPlatform* constants and GetCriteriaPlatformTypes()
stay in sync. Mechanically catches the exact class of drift this
commit fixes if a future platform value is added to one but not the
other.

Addressed reviews
=================

@mchmarny (NVIDIA#866) — Doc-audit gap. cli-reference and api-reference were
fixed at merge time and NVIDIA#896 picked up README/validations; this PR
catches the remaining surfaces (contributor/data, the three Go godoc
files, and the internal skill table).

@coderabbitai (NVIDIA#866) — NodeSelector limitation note on the
slinky-slurm-operator catalog row.

Internal PE + QA + DA panel review on draft NVIDIA#884 — extended audit
to pkg/api/doc.go (factual miss), pkg/recipe/doc.go (ordering),
.claude/skills/analyzing-snapshots/SKILL.md, plus the guard test
in pkg/recipe/doc_test.go.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants