Skip to content

docs: document gpu.clique relationship and non-MNNVL topology source#255

Merged
dmitsh merged 3 commits into
NVIDIA:mainfrom
resker:docs/readme-provider-updates
Apr 17, 2026
Merged

docs: document gpu.clique relationship and non-MNNVL topology source#255
dmitsh merged 3 commits into
NVIDIA:mainfrom
resker:docs/readme-provider-updates

Conversation

@resker
Copy link
Copy Markdown
Collaborator

@resker resker commented Apr 17, 2026

Summary

Adds technical context clarifying how Topograph's network topology labels relate to nvidia.com/gpu.clique (set by the GPU Operator device plugin) and why Topograph is often the only source of topology on non-MNNVL GPU clusters (DGX B200/B300).

Changes

  • docs/engines/k8s.md — new subsections:
    • Relationship to nvidia.com/gpu.clique — explains that the InfiniBand provider's accelerator label value is derived from the same ClusterUUID.CliqueId hardware identifiers used by gpu.clique on MNNVL systems (correlatable). On non-MNNVL systems gpu.clique is absent because the IMEX labeler requires GPU_FABRIC_STATE_COMPLETED, which non-MNNVL GPUs do not reach.
    • Mixed Workload Considerations — describes how topology-insensitive workloads (single-GPU inference, CPU services) fragment available NVLink/leaf-switch groups, and that topology labels are a prerequisite for schedulers like KAI Scheduler or Kueue TAS to mitigate this.
  • docs/providers/infiniband.md — notes that the accelerator value format matches gpu.clique in both infiniband-bm and infiniband-k8s variants.
  • docs/providers/netq.md — notes that NMX DomainUUID is a distinct identifier from gpu.clique's ClusterUUID.CliqueId; the values are not directly comparable.
  • README.md — adds a brief note that on non-MNNVL GPU clusters (DGX B200/B300 SuperPODs), gpu.clique is not set and Topograph with an InfiniBand provider is the only topology source.

Why this matters

For operators evaluating Topograph: it clarifies when Topograph is essential vs. when it complements existing signals. For developers: it documents a non-obvious invariant about where identifier formats match across providers vs. diverge.

Test plan

  • gpu.clique value format verified against NVIDIA/k8s-device-plugin/internal/lm/imex.go (IMEX labeler)
  • IsFabricAttached() behavior verified against NVIDIA/go-nvlib/pkg/nvlib/device/device.go
  • IB provider value format verified against pkg/providers/infiniband/common.go and bm.go (Cluster.ID() returns UUID + "." + cliqueID)
  • NetQ provider value format verified against pkg/providers/netq/nmx.go (DomainUUID source)

resker added 3 commits April 17, 2026 01:06
Adds a 'How Topograph Fits in the Kubernetes Topology Stack' section
after the Workflow overview, clarifying that Topograph (inter-node
network topology discovery) and the kubelet Topology Manager (intra-node
NUMA alignment) operate at different scopes and are complementary.
Includes a Mermaid architecture diagram.

Signed-off-by: Rob Esker <resker@nvidia.com>
The kubelet Topology Manager is a Kubernetes-specific concern and is only
relevant to readers of the k8s engine documentation. Moving the comparison
table and Mermaid diagram from README to docs/engines/k8s.md, immediately
after the label overview, where the context is clearest.

Signed-off-by: Rob Esker <resker@nvidia.com>
…/provider docs

- k8s.md: add section on relationship to nvidia.com/gpu.clique (IB
  provider produces same ClusterUUID.CliqueId value; gpu.clique absent
  on non-MNNVL systems making Topograph the only topology source)
- k8s.md: add Mixed Workload Considerations section explaining
  topology fragmentation when topology-insensitive workloads consume
  nodes alongside distributed training
- infiniband.md: note accelerator value format (ClusterUUID.CliqueId)
  matches nvidia.com/gpu.clique in both bm and k8s variants
- netq.md: note NMX DomainUUID differs in format from gpu.clique
- README.md: add sentence on non-MNNVL systems where gpu.clique is
  absent and Topograph is the only topology source

Signed-off-by: Rob Esker <resker@nvidia.com>
@resker resker requested a review from dmitsh as a code owner April 17, 2026 06:11
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.46%. Comparing base (1875ab8) to head (936e85b).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #255   +/-   ##
=======================================
  Coverage   68.46%   68.46%           
=======================================
  Files          82       82           
  Lines        4842     4842           
=======================================
  Hits         3315     3315           
  Misses       1395     1395           
  Partials      132      132           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR adds documentation clarifying the relationship between Topograph's network.topology.nvidia.com/accelerator label and the GPU Operator's nvidia.com/gpu.clique label, and why Topograph is the sole topology source on non-MNNVL GPU clusters (DGX B200/B300 SuperPODs). The technical claims were spot-checked against the source: bm.go and k8s.go both produce ClusterUUID.CliqueId via the same dot-joined path, the topology.KeyNodeClusterID annotation key matches the docs, and the NetQ DomainUUID sourced in nmx.go is demonstrably distinct from ClusterUUID.CliqueId.

Confidence Score: 5/5

Safe to merge — documentation-only PR with all technical claims verified against the source code.

All four changed files are documentation. Key technical claims (IB provider producing ClusterUUID.CliqueId, NetQ using DomainUUID, topology.KeyNodeClusterID annotation key) were verified directly against bm.go, k8s.go, nmx.go, and topology.go. The gpu.clique value format (ClusterUUID.CliqueId with dot separator) was confirmed via external NVIDIA documentation. No code paths are affected. No P0 or P1 findings.

No files require special attention.

Important Files Changed

Filename Overview
docs/engines/k8s.md Adds two new subsections: relationship between accelerator label and gpu.clique (technically accurate per bm.go/k8s.go), and mixed workload considerations (correct, dense paragraph).
docs/providers/infiniband.md Appends format note to both infiniband-bm and infiniband-k8s How It Works steps; format claim verified against bm.go Cluster.ID() and k8s.go parseClusterID().
docs/providers/netq.md Appends distinction note about DomainUUID vs ClusterUUID.CliqueId; DomainUUID source verified in nmx.go; claim about both identifying the same physical domain is a reasonable engineering assertion.
README.md Adds one sentence noting that non-MNNVL clusters (DGX B200/B300) don't get gpu.clique and rely solely on Topograph's IB provider; factually accurate.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    HW["GPU Hardware\n(nvidia-smi: ClusterUUID + CliqueId)"]

    subgraph MNNVL["MNNVL Systems (GB200 NVL72)"]
        DRA["DRA Provider\nreads gpu.clique"]
        IBMNNVL["InfiniBand Provider\n(if used)"]
        GPU_CLIQUE["nvidia.com/gpu.clique\n= ClusterUUID.CliqueId"]
        ACC_MNNVL["accelerator label\n= ClusterUUID.CliqueId"]
    end

    subgraph NonMNNVL["Non-MNNVL Systems (DGX B200/B300)"]
        IB["InfiniBand Provider\n(only topology source)"]
        NO_GPU_CLIQUE["nvidia.com/gpu.clique\nnot set"]
        ACC_NON["accelerator label\n= ClusterUUID.CliqueId"]
    end

    subgraph NetQFlow["NetQ / MNNVL via NMX"]
        NMX["NetQ NMX API\nDomainUUID"]
        ACC_NETQ["accelerator label\n= DomainUUID\n(different format)"]
    end

    HW -->|"GPU Operator reads"| GPU_CLIQUE
    HW -->|"IB provider reads"| IBMNNVL
    HW -->|"IB provider reads"| IB
    DRA --> GPU_CLIQUE
    IBMNNVL --> ACC_MNNVL
    GPU_CLIQUE -.->|"same value, correlatable"| ACC_MNNVL
    IB --> ACC_NON
    IB --> NO_GPU_CLIQUE
    NMX --> ACC_NETQ
    ACC_NETQ -.->|"same physical domain,\ndifferent string value"| ACC_NON
Loading

Reviews (1): Last reviewed commit: "docs: add gpu.clique relationship and no..." | Re-trigger Greptile

@dmitsh dmitsh merged commit 0d1cefa into NVIDIA:main Apr 17, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants