docs(reference): clarify accelerator / Capacity Block / clique semantics#289
Conversation
Greptile SummaryThis docs-only PR refines the semantic descriptions of three label entries in Confidence Score: 5/5Safe to merge — docs-only change with no runtime impact and two minor P2 style nits. All findings are P2 (inconsistent NVML constant name, ephemeral release phrasing). The semantic content is well-sourced, the provider-dependency clarifications are accurate per the PR description and cited AWS API references, and no code paths are touched. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Scheduler needs topology label] --> B{Kubernetes cluster?}
B -- No --> C[Read Slurm topology.conf directly]
B -- Yes --> D{MNNVL hardware?}
D -- No --> E[Use network.topology.nvidia.com/accelerator\nInfiniBand provider only source]
D -- Yes --> F{Fabric Manager init\ncompleted?\nNVML_GPU_FABRIC_STATE_COMPLETED}
F -- No --> G[nvidia.com/gpu.clique absent\nUse network.topology.nvidia.com/accelerator]
F -- Yes --> H{Which provider?}
H -- AWS --> I{Granularity needed?}
H -- DRA / InfiniBand / Lambda AI --> J[Both labels carry same value\nnvidia.com/gpu.clique or accelerator]
I -- NVL Domain --> K[Use network.topology.nvidia.com/accelerator\nCapacityBlockId = NVL Domain]
I -- NVL Partition --> L[Use nvidia.com/gpu.clique\nCliqueID = finer sub-domain]
J --> M[Need switch-level locality?]
K --> M
L --> M
E --> M
G --> M
M -- Yes --> N[Add leaf / spine / core labels\nfrom InfiniBand or NetQ provider]
M -- No --> O[Done]
N --> O
Reviews (5): Last reviewed commit: "docs(reference): clarify accelerator / C..." | Re-trigger Greptile |
7ebfd57 to
31f5d96
Compare
8bd277d to
c9c0d8d
Compare
Refines entries in docs/reference/node-labels.md and docs/engines/k8s.md to reflect the distinct underlying concepts — and the semantic mismatch that exists between topograph providers: 1. `network.topology.nvidia.com/accelerator`: renamed from "NVLink domain (clique) ID" to "Accelerated interconnect domain identifier" and notes that exact semantics are provider-dependent. MNNVL-aware providers (DRA, InfiniBand, Lambda AI) write an NVL Partition identifier (`<ClusterUUID>.<CliqueID>`, Fabric-Manager-derived). The AWS provider writes AWS's CapacityBlockId, a reservation-scoped value co-extensive with an UltraServer (i.e., the NVL Domain) on P6e-GB200. 2. `topology.k8s.aws/capacity-block-id`: replaces the prior "AWS Capacity Block (NVLink domain)" framing with citation-backed language. Per AWS's EC2 API reference for InstanceTopology, on UltraServer instances this "identifies instances within the UltraServer domain", making it NVL Domain-scoped. AWS's explicit NVL Domain label `topology.k8s.aws/ultraserver-id` is applied on SageMaker HyperPod-managed EKS; on plain EKS or self-managed Kubernetes, the NVL Domain must be derived from `nvidia.com/gpu.clique` (its ClusterUUID prefix encodes the domain). 3. `nvidia.com/gpu.clique`: clarifies that the value is an NVL Partition identifier formatted `<ClusterUUID>.<CliqueID>`, with ClusterUUID identifying the physical NVL Domain and CliqueID identifying a Fabric-Manager-assigned logical sub-domain. Documents the `NVML_GPU_FABRIC_STATE_COMPLETED` precondition and the "multiple cliques per NVL Domain" case (e.g., x72 UltraServer split into two x36 halves). Also introduces NVIDIA Fabric Manager and NVML once in the relevant paragraph (rather than using `GPU_FABRIC_STATE_COMPLETED` in passing without explanation), and adds a "Choosing between `accelerator` and `nvidia.com/gpu.clique` for scheduling" subsection with a tier-by-tier breakdown of locality levels not covered by the clique label plus caveats on refresh cadence (60s GFD interval) and `FABRIC_MODE_RESTART` persistence. The parallel mention in docs/engines/k8s.md is converted to a cross- reference to avoid duplicating the Fabric-Manager explanation. Finally, the "Integration with NVSentinel" section is updated to reflect that NVSentinel/#1226 (merged 2026-04-23) added the four `network.topology.nvidia.com/*` labels to the Metadata Augmentor's default `allowedLabels`. Operators on earlier releases retain the explicit-opt-in snippet; a cross-link to NVSentinel's docs/INTEGRATIONS.md is added for the downstream-consumer perspective. Uses "NVL Domain" / "NVL Partition" terminology throughout to distinguish the physical rack-scale MNNVL from the logical sub-domain Fabric Manager assigns within it. Signed-off-by: Rob Esker <resker@nvidia.com>
c9c0d8d to
1fa6736
Compare
PR #268 removed the `branches: [main]` push filter from `publish-fern-docs.yml` while keeping the `tags: [docs/v*]` filter. With GitHub Actions, defining `tags:` without a corresponding `branches:` restricts push events to tag refs only — branch pushes (including main) no longer trigger the workflow even when their changed paths match the `paths:` filter. Symptom: the live Fern site at https://topograph.docs.buildwithfern.com/topograph has not been republished since the manual workflow_dispatch on 2026-04-20T16:51Z. PRs #284, #289, #290, #291, and #292 all touched docs/ but produced zero workflow runs. The Reference section restored by #284 (and the clique-semantics clarifications added by #289) are on main but invisible on the published site. Restore `branches: [main]` so push events to main with docs/ or fern/ changes resume triggering publishes. Tag pushes for `docs/v*` and manual `workflow_dispatch` continue to work unchanged. To clear the backlog after this PR merges, dispatch the workflow manually one time: gh workflow run publish-fern-docs.yml --repo NVIDIA/topograph --ref main Signed-off-by: Rob Esker <resker@nvidia.com>
Description
Refines the semantic description of three entries in `docs/reference/node-labels.md` to reflect the distinct underlying concepts — and the semantic mismatch that exists between topograph providers — with citation-backed language.
Surfaced in NVIDIA/NVSentinel#1205 while discussing how NVSentinel's Metadata Augmentor should treat Topograph's labels vs existing AWS NFD labels. Additional empirical data from an NVIDIA-internal cluster showed that multiple distinct `nvidia.com/gpu.clique` values can share the same `topology.k8s.aws/capacity-block-id`, and cliques can be absent when Fabric Manager initialization has not completed — invalidating the prior doc's implicit 1:1 framing.
Changes
1. `network.topology.nvidia.com/accelerator` description
The label was previously described as "NVLink domain (clique) ID". That's accurate for providers that derive their accelerator value from Fabric Manager (DRA, InfiniBand, Lambda AI), but not for the AWS provider, which derives its accelerator value from the AWS Capacity Block reservation ID (`pkg/providers/aws/instance_topology.go#L110-L111`) — a reservation-scoped identifier, not an NVLink-partition identifier.
New description names the label "Accelerated interconnect domain identifier" and flags that exact semantics are provider-dependent, pointing at the provider matrix below the table.
2. `topology.k8s.aws/capacity-block-id` description
Replaces the prior "AWS Capacity Block (NVLink domain)" framing. Per the AWS EC2 API reference for `InstanceTopology`, on UltraServer instances the field "identifies instances within the UltraServer domain" — a reservation-scoped grouping. On P6e-GB200 it is co-extensive with one UltraServer (AWS requires reserving the UltraServer as a unit per the EKS UltraServer guide), but AWS's explicit "same NVLink domain" label is `topology.k8s.aws/ultraserver-id`, not `capacity-block-id`. The finer-grained NVLink partition — potentially multiple per UltraServer — is surfaced by the NVIDIA GPU Operator as `nvidia.com/gpu.clique`.
3. `nvidia.com/gpu.clique` description
Adds the Fabric-Manager-completion precondition (`NVML_GPU_FABRIC_STATE_COMPLETED`), notes that the label can be absent on MNNVL nodes where Fabric Manager init has not completed, and clarifies that a clique is a logical sub-domain of an MNNVL — so multiple clique values can appear within a single NVLink domain (e.g., an x72 UltraServer split into two x36 halves).
Follow-up (not in this PR)
The semantic mismatch between topograph's AWS provider (writes reservation-scoped CapacityBlockId) and the other MNNVL-aware providers (write Fabric-Manager clique IDs) will be tracked as a separate design issue. Concrete options there include: (a) AWS provider prefers `topology.k8s.aws/ultraserver-id` (when present) or `nvidia.com/gpu.clique` (when Fabric Manager has completed) over `CapacityBlockId`; (b) `accelerator` is split into two labels with different granularity; (c) left as-is and the doc's provider-dependent note suffices. Out of scope here; this PR is doc-only.
Checklist