docs(reference): clarify accelerator / Capacity Block / clique semantics by resker · Pull Request #289 · NVIDIA/topograph

resker · 2026-04-23T01:28:27Z

Description

Refines the semantic description of three entries in `docs/reference/node-labels.md` to reflect the distinct underlying concepts — and the semantic mismatch that exists between topograph providers — with citation-backed language.

Surfaced in NVIDIA/NVSentinel#1205 while discussing how NVSentinel's Metadata Augmentor should treat Topograph's labels vs existing AWS NFD labels. Additional empirical data from an NVIDIA-internal cluster showed that multiple distinct `nvidia.com/gpu.clique` values can share the same `topology.k8s.aws/capacity-block-id`, and cliques can be absent when Fabric Manager initialization has not completed — invalidating the prior doc's implicit 1:1 framing.

Changes

1. `network.topology.nvidia.com/accelerator` description

The label was previously described as "NVLink domain (clique) ID". That's accurate for providers that derive their accelerator value from Fabric Manager (DRA, InfiniBand, Lambda AI), but not for the AWS provider, which derives its accelerator value from the AWS Capacity Block reservation ID (`pkg/providers/aws/instance_topology.go#L110-L111`) — a reservation-scoped identifier, not an NVLink-partition identifier.

New description names the label "Accelerated interconnect domain identifier" and flags that exact semantics are provider-dependent, pointing at the provider matrix below the table.

2. `topology.k8s.aws/capacity-block-id` description

Replaces the prior "AWS Capacity Block (NVLink domain)" framing. Per the AWS EC2 API reference for `InstanceTopology`, on UltraServer instances the field "identifies instances within the UltraServer domain" — a reservation-scoped grouping. On P6e-GB200 it is co-extensive with one UltraServer (AWS requires reserving the UltraServer as a unit per the EKS UltraServer guide), but AWS's explicit "same NVLink domain" label is `topology.k8s.aws/ultraserver-id`, not `capacity-block-id`. The finer-grained NVLink partition — potentially multiple per UltraServer — is surfaced by the NVIDIA GPU Operator as `nvidia.com/gpu.clique`.

3. `nvidia.com/gpu.clique` description

Adds the Fabric-Manager-completion precondition (`NVML_GPU_FABRIC_STATE_COMPLETED`), notes that the label can be absent on MNNVL nodes where Fabric Manager init has not completed, and clarifies that a clique is a logical sub-domain of an MNNVL — so multiple clique values can appear within a single NVLink domain (e.g., an x72 UltraServer split into two x36 halves).

Follow-up (not in this PR)

The semantic mismatch between topograph's AWS provider (writes reservation-scoped CapacityBlockId) and the other MNNVL-aware providers (write Fabric-Manager clique IDs) will be tracked as a separate design issue. Concrete options there include: (a) AWS provider prefers `topology.k8s.aws/ultraserver-id` (when present) or `nvidia.com/gpu.clique` (when Fabric Manager has completed) over `CapacityBlockId`; (b) `accelerator` is split into two labels with different granularity; (c) left as-is and the doc's provider-dependent note suffices. Out of scope here; this PR is doc-only.

Checklist

Documentation Impact Evaluation — the changed section is `docs/reference/node-labels.md`; no other doc pages repeat these definitions authoritatively
`make qualify` — N/A (docs-only)
Every commit has a DCO sign-off

greptile-apps · 2026-04-23T01:30:03Z

Greptile Summary

This docs-only PR refines the semantic descriptions of three label entries in docs/reference/node-labels.md to accurately reflect provider-dependent granularity (accelerator carries an NVL Partition ID on MNNVL-aware providers but a reservation-scoped CapacityBlockId on AWS), adds Fabric Manager initialisation preconditions and a scheduler label-choice guide for nvidia.com/gpu.clique, and updates the NVSentinel integration section to reflect default allowedLabels inclusion. The k8s.md change is a one-line cross-reference clean-up.

Confidence Score: 5/5

Safe to merge — docs-only change with no runtime impact and two minor P2 style nits.

All findings are P2 (inconsistent NVML constant name, ephemeral release phrasing). The semantic content is well-sourced, the provider-dependency clarifications are accurate per the PR description and cited AWS API references, and no code paths are touched.

No files require special attention.

Important Files Changed

Filename	Overview
docs/reference/node-labels.md	Significantly expanded with provider-dependent accelerator semantics, Fabric Manager init preconditions, a scheduler label-choice guide, and richer nvidia.com/gpu.clique and capacity-block-id descriptions; two minor style inconsistencies (NVML constant name, ephemeral release phrasing).
docs/engines/k8s.md	Single-line change replacing an inline GPU_FABRIC_STATE_COMPLETED explanation with a cross-reference link to node-labels.md; the link has no anchor, but all content is still reachable at the top of that page.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Scheduler needs topology label] --> B{Kubernetes cluster?}
    B -- No --> C[Read Slurm topology.conf directly]
    B -- Yes --> D{MNNVL hardware?}
    D -- No --> E[Use network.topology.nvidia.com/accelerator\nInfiniBand provider only source]
    D -- Yes --> F{Fabric Manager init\ncompleted?\nNVML_GPU_FABRIC_STATE_COMPLETED}
    F -- No --> G[nvidia.com/gpu.clique absent\nUse network.topology.nvidia.com/accelerator]
    F -- Yes --> H{Which provider?}
    H -- AWS --> I{Granularity needed?}
    H -- DRA / InfiniBand / Lambda AI --> J[Both labels carry same value\nnvidia.com/gpu.clique or accelerator]
    I -- NVL Domain --> K[Use network.topology.nvidia.com/accelerator\nCapacityBlockId = NVL Domain]
    I -- NVL Partition --> L[Use nvidia.com/gpu.clique\nCliqueID = finer sub-domain]
    J --> M[Need switch-level locality?]
    K --> M
    L --> M
    E --> M
    G --> M
    M -- Yes --> N[Add leaf / spine / core labels\nfrom InfiniBand or NetQ provider]
    M -- No --> O[Done]
    N --> O

_{Reviews (5): Last reviewed commit: "docs(reference): clarify accelerator / C..." | Re-trigger Greptile}

Refines entries in docs/reference/node-labels.md and docs/engines/k8s.md to reflect the distinct underlying concepts — and the semantic mismatch that exists between topograph providers: 1. `network.topology.nvidia.com/accelerator`: renamed from "NVLink domain (clique) ID" to "Accelerated interconnect domain identifier" and notes that exact semantics are provider-dependent. MNNVL-aware providers (DRA, InfiniBand, Lambda AI) write an NVL Partition identifier (`<ClusterUUID>.<CliqueID>`, Fabric-Manager-derived). The AWS provider writes AWS's CapacityBlockId, a reservation-scoped value co-extensive with an UltraServer (i.e., the NVL Domain) on P6e-GB200. 2. `topology.k8s.aws/capacity-block-id`: replaces the prior "AWS Capacity Block (NVLink domain)" framing with citation-backed language. Per AWS's EC2 API reference for InstanceTopology, on UltraServer instances this "identifies instances within the UltraServer domain", making it NVL Domain-scoped. AWS's explicit NVL Domain label `topology.k8s.aws/ultraserver-id` is applied on SageMaker HyperPod-managed EKS; on plain EKS or self-managed Kubernetes, the NVL Domain must be derived from `nvidia.com/gpu.clique` (its ClusterUUID prefix encodes the domain). 3. `nvidia.com/gpu.clique`: clarifies that the value is an NVL Partition identifier formatted `<ClusterUUID>.<CliqueID>`, with ClusterUUID identifying the physical NVL Domain and CliqueID identifying a Fabric-Manager-assigned logical sub-domain. Documents the `NVML_GPU_FABRIC_STATE_COMPLETED` precondition and the "multiple cliques per NVL Domain" case (e.g., x72 UltraServer split into two x36 halves). Also introduces NVIDIA Fabric Manager and NVML once in the relevant paragraph (rather than using `GPU_FABRIC_STATE_COMPLETED` in passing without explanation), and adds a "Choosing between `accelerator` and `nvidia.com/gpu.clique` for scheduling" subsection with a tier-by-tier breakdown of locality levels not covered by the clique label plus caveats on refresh cadence (60s GFD interval) and `FABRIC_MODE_RESTART` persistence. The parallel mention in docs/engines/k8s.md is converted to a cross- reference to avoid duplicating the Fabric-Manager explanation. Finally, the "Integration with NVSentinel" section is updated to reflect that NVSentinel/#1226 (merged 2026-04-23) added the four `network.topology.nvidia.com/*` labels to the Metadata Augmentor's default `allowedLabels`. Operators on earlier releases retain the explicit-opt-in snippet; a cross-link to NVSentinel's docs/INTEGRATIONS.md is added for the downstream-consumer perspective. Uses "NVL Domain" / "NVL Partition" terminology throughout to distinguish the physical rack-scale MNNVL from the logical sub-domain Fabric Manager assigns within it. Signed-off-by: Rob Esker <resker@nvidia.com>

PR #268 removed the `branches: [main]` push filter from `publish-fern-docs.yml` while keeping the `tags: [docs/v*]` filter. With GitHub Actions, defining `tags:` without a corresponding `branches:` restricts push events to tag refs only — branch pushes (including main) no longer trigger the workflow even when their changed paths match the `paths:` filter. Symptom: the live Fern site at https://topograph.docs.buildwithfern.com/topograph has not been republished since the manual workflow_dispatch on 2026-04-20T16:51Z. PRs #284, #289, #290, #291, and #292 all touched docs/ but produced zero workflow runs. The Reference section restored by #284 (and the clique-semantics clarifications added by #289) are on main but invisible on the published site. Restore `branches: [main]` so push events to main with docs/ or fern/ changes resume triggering publishes. Tag pushes for `docs/v*` and manual `workflow_dispatch` continue to work unchanged. To clear the backlog after this PR merges, dispatch the workflow manually one time: gh workflow run publish-fern-docs.yml --repo NVIDIA/topograph --ref main Signed-off-by: Rob Esker <resker@nvidia.com>

resker requested a review from dmitsh as a code owner April 23, 2026 01:28

This was referenced Apr 23, 2026

Integration with NVIDIA/topograph's topology node labels NVIDIA/NVSentinel#1205

Closed

feat(metadata-augmentor): include Topograph topology labels by default NVIDIA/NVSentinel#1226

Merged

resker force-pushed the docs/aws-capacity-block-clarification branch from 7ebfd57 to 31f5d96 Compare April 23, 2026 17:54

resker changed the title ~~docs(reference): clarify AWS Capacity Block vs NVLink domain equivalence~~ docs(reference): clarify accelerator / Capacity Block / clique semantics Apr 23, 2026

resker mentioned this pull request Apr 23, 2026

AWS provider's accelerator label has reservation-scoped semantics, diverging from other MNNVL-aware providers #293

Open

resker force-pushed the docs/aws-capacity-block-clarification branch 3 times, most recently from 8bd277d to c9c0d8d Compare April 24, 2026 01:18

resker force-pushed the docs/aws-capacity-block-clarification branch from c9c0d8d to 1fa6736 Compare April 24, 2026 01:19

dmitsh approved these changes Apr 24, 2026

View reviewed changes

dmitsh merged commit fdb4428 into NVIDIA:main Apr 24, 2026
6 checks passed

resker mentioned this pull request Apr 29, 2026

fix(ci): restore branches filter on Fern publish push trigger #304

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(reference): clarify accelerator / Capacity Block / clique semantics#289

docs(reference): clarify accelerator / Capacity Block / clique semantics#289
dmitsh merged 1 commit into
NVIDIA:mainfrom
resker:docs/aws-capacity-block-clarification

resker commented Apr 23, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

resker commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

1. `network.topology.nvidia.com/accelerator` description

2. `topology.k8s.aws/capacity-block-id` description

3. `nvidia.com/gpu.clique` description

Follow-up (not in this PR)

Checklist

Uh oh!

greptile-apps Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

resker commented Apr 23, 2026 •

edited

Loading

greptile-apps Bot commented Apr 23, 2026 •

edited

Loading