Skip to content

docs(reference): clarify accelerator / Capacity Block / clique semantics#289

Merged
dmitsh merged 1 commit into
NVIDIA:mainfrom
resker:docs/aws-capacity-block-clarification
Apr 24, 2026
Merged

docs(reference): clarify accelerator / Capacity Block / clique semantics#289
dmitsh merged 1 commit into
NVIDIA:mainfrom
resker:docs/aws-capacity-block-clarification

Conversation

@resker
Copy link
Copy Markdown
Collaborator

@resker resker commented Apr 23, 2026

Description

Refines the semantic description of three entries in `docs/reference/node-labels.md` to reflect the distinct underlying concepts — and the semantic mismatch that exists between topograph providers — with citation-backed language.

Surfaced in NVIDIA/NVSentinel#1205 while discussing how NVSentinel's Metadata Augmentor should treat Topograph's labels vs existing AWS NFD labels. Additional empirical data from an NVIDIA-internal cluster showed that multiple distinct `nvidia.com/gpu.clique` values can share the same `topology.k8s.aws/capacity-block-id`, and cliques can be absent when Fabric Manager initialization has not completed — invalidating the prior doc's implicit 1:1 framing.

Changes

1. `network.topology.nvidia.com/accelerator` description

The label was previously described as "NVLink domain (clique) ID". That's accurate for providers that derive their accelerator value from Fabric Manager (DRA, InfiniBand, Lambda AI), but not for the AWS provider, which derives its accelerator value from the AWS Capacity Block reservation ID (`pkg/providers/aws/instance_topology.go#L110-L111`) — a reservation-scoped identifier, not an NVLink-partition identifier.

New description names the label "Accelerated interconnect domain identifier" and flags that exact semantics are provider-dependent, pointing at the provider matrix below the table.

2. `topology.k8s.aws/capacity-block-id` description

Replaces the prior "AWS Capacity Block (NVLink domain)" framing. Per the AWS EC2 API reference for `InstanceTopology`, on UltraServer instances the field "identifies instances within the UltraServer domain" — a reservation-scoped grouping. On P6e-GB200 it is co-extensive with one UltraServer (AWS requires reserving the UltraServer as a unit per the EKS UltraServer guide), but AWS's explicit "same NVLink domain" label is `topology.k8s.aws/ultraserver-id`, not `capacity-block-id`. The finer-grained NVLink partition — potentially multiple per UltraServer — is surfaced by the NVIDIA GPU Operator as `nvidia.com/gpu.clique`.

3. `nvidia.com/gpu.clique` description

Adds the Fabric-Manager-completion precondition (`NVML_GPU_FABRIC_STATE_COMPLETED`), notes that the label can be absent on MNNVL nodes where Fabric Manager init has not completed, and clarifies that a clique is a logical sub-domain of an MNNVL — so multiple clique values can appear within a single NVLink domain (e.g., an x72 UltraServer split into two x36 halves).

Follow-up (not in this PR)

The semantic mismatch between topograph's AWS provider (writes reservation-scoped CapacityBlockId) and the other MNNVL-aware providers (write Fabric-Manager clique IDs) will be tracked as a separate design issue. Concrete options there include: (a) AWS provider prefers `topology.k8s.aws/ultraserver-id` (when present) or `nvidia.com/gpu.clique` (when Fabric Manager has completed) over `CapacityBlockId`; (b) `accelerator` is split into two labels with different granularity; (c) left as-is and the doc's provider-dependent note suffices. Out of scope here; this PR is doc-only.

Checklist

  • Documentation Impact Evaluation — the changed section is `docs/reference/node-labels.md`; no other doc pages repeat these definitions authoritatively
  • `make qualify` — N/A (docs-only)
  • Every commit has a DCO sign-off

@resker resker requested a review from dmitsh as a code owner April 23, 2026 01:28
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This docs-only PR refines the semantic descriptions of three label entries in docs/reference/node-labels.md to accurately reflect provider-dependent granularity (accelerator carries an NVL Partition ID on MNNVL-aware providers but a reservation-scoped CapacityBlockId on AWS), adds Fabric Manager initialisation preconditions and a scheduler label-choice guide for nvidia.com/gpu.clique, and updates the NVSentinel integration section to reflect default allowedLabels inclusion. The k8s.md change is a one-line cross-reference clean-up.

Confidence Score: 5/5

Safe to merge — docs-only change with no runtime impact and two minor P2 style nits.

All findings are P2 (inconsistent NVML constant name, ephemeral release phrasing). The semantic content is well-sourced, the provider-dependency clarifications are accurate per the PR description and cited AWS API references, and no code paths are touched.

No files require special attention.

Important Files Changed

Filename Overview
docs/reference/node-labels.md Significantly expanded with provider-dependent accelerator semantics, Fabric Manager init preconditions, a scheduler label-choice guide, and richer nvidia.com/gpu.clique and capacity-block-id descriptions; two minor style inconsistencies (NVML constant name, ephemeral release phrasing).
docs/engines/k8s.md Single-line change replacing an inline GPU_FABRIC_STATE_COMPLETED explanation with a cross-reference link to node-labels.md; the link has no anchor, but all content is still reachable at the top of that page.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Scheduler needs topology label] --> B{Kubernetes cluster?}
    B -- No --> C[Read Slurm topology.conf directly]
    B -- Yes --> D{MNNVL hardware?}
    D -- No --> E[Use network.topology.nvidia.com/accelerator\nInfiniBand provider only source]
    D -- Yes --> F{Fabric Manager init\ncompleted?\nNVML_GPU_FABRIC_STATE_COMPLETED}
    F -- No --> G[nvidia.com/gpu.clique absent\nUse network.topology.nvidia.com/accelerator]
    F -- Yes --> H{Which provider?}
    H -- AWS --> I{Granularity needed?}
    H -- DRA / InfiniBand / Lambda AI --> J[Both labels carry same value\nnvidia.com/gpu.clique or accelerator]
    I -- NVL Domain --> K[Use network.topology.nvidia.com/accelerator\nCapacityBlockId = NVL Domain]
    I -- NVL Partition --> L[Use nvidia.com/gpu.clique\nCliqueID = finer sub-domain]
    J --> M[Need switch-level locality?]
    K --> M
    L --> M
    E --> M
    G --> M
    M -- Yes --> N[Add leaf / spine / core labels\nfrom InfiniBand or NetQ provider]
    M -- No --> O[Done]
    N --> O
Loading

Reviews (5): Last reviewed commit: "docs(reference): clarify accelerator / C..." | Re-trigger Greptile

@resker resker force-pushed the docs/aws-capacity-block-clarification branch from 7ebfd57 to 31f5d96 Compare April 23, 2026 17:54
@resker resker changed the title docs(reference): clarify AWS Capacity Block vs NVLink domain equivalence docs(reference): clarify accelerator / Capacity Block / clique semantics Apr 23, 2026
@resker resker force-pushed the docs/aws-capacity-block-clarification branch 3 times, most recently from 8bd277d to c9c0d8d Compare April 24, 2026 01:18
Refines entries in docs/reference/node-labels.md and docs/engines/k8s.md
to reflect the distinct underlying concepts — and the semantic mismatch
that exists between topograph providers:

1. `network.topology.nvidia.com/accelerator`: renamed from "NVLink
   domain (clique) ID" to "Accelerated interconnect domain identifier"
   and notes that exact semantics are provider-dependent. MNNVL-aware
   providers (DRA, InfiniBand, Lambda AI) write an NVL Partition
   identifier (`<ClusterUUID>.<CliqueID>`, Fabric-Manager-derived).
   The AWS provider writes AWS's CapacityBlockId, a reservation-scoped
   value co-extensive with an UltraServer (i.e., the NVL Domain) on
   P6e-GB200.

2. `topology.k8s.aws/capacity-block-id`: replaces the prior
   "AWS Capacity Block (NVLink domain)" framing with citation-backed
   language. Per AWS's EC2 API reference for InstanceTopology, on
   UltraServer instances this "identifies instances within the
   UltraServer domain", making it NVL Domain-scoped. AWS's explicit
   NVL Domain label `topology.k8s.aws/ultraserver-id` is applied on
   SageMaker HyperPod-managed EKS; on plain EKS or self-managed
   Kubernetes, the NVL Domain must be derived from
   `nvidia.com/gpu.clique` (its ClusterUUID prefix encodes the domain).

3. `nvidia.com/gpu.clique`: clarifies that the value is an
   NVL Partition identifier formatted `<ClusterUUID>.<CliqueID>`,
   with ClusterUUID identifying the physical NVL Domain and CliqueID
   identifying a Fabric-Manager-assigned logical sub-domain.
   Documents the `NVML_GPU_FABRIC_STATE_COMPLETED` precondition and
   the "multiple cliques per NVL Domain" case (e.g., x72 UltraServer
   split into two x36 halves).

Also introduces NVIDIA Fabric Manager and NVML once in the relevant
paragraph (rather than using `GPU_FABRIC_STATE_COMPLETED` in passing
without explanation), and adds a "Choosing between `accelerator` and
`nvidia.com/gpu.clique` for scheduling" subsection with a tier-by-tier
breakdown of locality levels not covered by the clique label plus
caveats on refresh cadence (60s GFD interval) and
`FABRIC_MODE_RESTART` persistence.

The parallel mention in docs/engines/k8s.md is converted to a cross-
reference to avoid duplicating the Fabric-Manager explanation.

Finally, the "Integration with NVSentinel" section is updated to reflect
that NVSentinel/#1226 (merged 2026-04-23) added the four
`network.topology.nvidia.com/*` labels to the Metadata Augmentor's
default `allowedLabels`. Operators on earlier releases retain the
explicit-opt-in snippet; a cross-link to NVSentinel's
docs/INTEGRATIONS.md is added for the downstream-consumer perspective.

Uses "NVL Domain" / "NVL Partition" terminology throughout to
distinguish the physical rack-scale MNNVL from the logical sub-domain
Fabric Manager assigns within it.

Signed-off-by: Rob Esker <resker@nvidia.com>
@resker resker force-pushed the docs/aws-capacity-block-clarification branch from c9c0d8d to 1fa6736 Compare April 24, 2026 01:19
@dmitsh dmitsh merged commit fdb4428 into NVIDIA:main Apr 24, 2026
6 checks passed
dmitsh pushed a commit that referenced this pull request Apr 29, 2026
PR #268 removed the `branches: [main]` push filter from
`publish-fern-docs.yml` while keeping the `tags: [docs/v*]` filter. With
GitHub Actions, defining `tags:` without a corresponding `branches:`
restricts push events to tag refs only — branch pushes (including main)
no longer trigger the workflow even when their changed paths match the
`paths:` filter.

Symptom: the live Fern site at
https://topograph.docs.buildwithfern.com/topograph has not been
republished since the manual workflow_dispatch on 2026-04-20T16:51Z.
PRs #284, #289, #290, #291, and #292 all touched docs/ but produced
zero workflow runs. The Reference section restored by #284 (and the
clique-semantics clarifications added by #289) are on main but invisible
on the published site.

Restore `branches: [main]` so push events to main with docs/ or fern/
changes resume triggering publishes. Tag pushes for `docs/v*` and manual
`workflow_dispatch` continue to work unchanged.

To clear the backlog after this PR merges, dispatch the workflow
manually one time:

  gh workflow run publish-fern-docs.yml --repo NVIDIA/topograph --ref main

Signed-off-by: Rob Esker <resker@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants