docs: small-batch reconciliation of blog draft content into repo docs by resker · Pull Request #264 · NVIDIA/topograph

resker · 2026-04-17T14:39:15Z

Summary

Folds high-value content from an unpublished blog draft into the repository docs so the blog can reference canonical sources rather than being the sole host of this material. All four changes are additive, independent, and low-risk.

Changes

README.md — expand the "Motivation and Problem Statement" section (merged via docs(readme): add 'Motivation and Problem Statement' section #247) with the disaggregated inference scenario: prefill/decode separation, KV cache transfer across the worker boundary, and the NVLink vs. Ethernet performance cliff. Complements the existing distributed-training framing.
docs/providers/infiniband.md — add a "Why automate IB discovery?" paragraph with the scale argument: hand-maintained labels work at ~32 nodes, break at 1,000 with fabric churn.
docs/providers/netq.md — add an "Observed vs. Intended Topology" section describing NetQ's distinctive ability to observe degraded-but-up links via live telemetry. This signal is invisible to ibnetdiscover-based discovery and unreported by any cloud placement API.
CONTRIBUTING.md — expand beyond the current 62 lines of DCO boilerplate with:
- Community section — Kubernetes Slack channels (#topology-aware-scheduling, #gpu-nvidia)
- Areas where contributions are especially welcome — new providers, on-premises network fabrics, Kubernetes Workload API / KEP-5732, KEP-4962 upstream label schema, Grove ClusterTopology integration

Deferred to a follow-up

One additional small-batch item from the same blog-gap analysis — a KEP-4962 "Upstream alignment" paragraph in docs/reference/node-labels.md — is intentionally deferred to a follow-up PR, since that file does not yet exist on main (it's introduced by #254).

Test plan

Markdown renders cleanly in GitHub preview for all four files
All external links verified (KEP-5732, KEP-4962, Grove, Kubernetes Workload API, Kubernetes Slack channel archives)
Provider list in CONTRIBUTING.md (AWS, GCP, OCI, Nebius, Lambda AI, CoreWeave) verified against pkg/registry/registry.go
No conflicts with other open PRs — this PR touches README, infiniband.md, and netq.md in sections that do not overlap with the edits in open PR docs: document gpu.clique relationship and non-MNNVL topology source #255

greptile-apps · 2026-04-17T14:40:57Z

Greptile Summary

Four additive, independent documentation changes that fold content from an unpublished blog draft into the repo: a disaggregated-inference paragraph in README.md, a scale-motivation paragraph in docs/providers/infiniband.md, an "Observed vs. Intended Topology" section in docs/providers/netq.md, and a Community section with Kubernetes Slack links in CONTRIBUTING.md. No code is changed.

Confidence Score: 5/5

Safe to merge — docs-only, no code changes, all four additions are technically accurate and well-placed.

All changes are additive documentation with no code modifications. Content is technically sound and consistent with the existing provider docs. No broken links introduced in the changed content, and no P0/P1 findings.

No files require special attention.

Important Files Changed

Filename	Overview
README.md	Adds a disaggregated inference paragraph to the Motivation section, illustrating KV-cache transfer cost and NVLink vs. Ethernet performance cliff; technically accurate and well-integrated into the existing narrative.
CONTRIBUTING.md	Adds a Community section with Kubernetes Slack links and a reference to the pinned Roadmap issue; expansion is clean and doesn't introduce any broken references in the current content.
docs/providers/infiniband.md	Prepends a "Why automate IB discovery?" paragraph giving a concrete scale argument (~32 vs 1,000 nodes); fits naturally before the existing provider choice guidance.
docs/providers/netq.md	Inserts an "Observed vs. Intended Topology" section highlighting NetQ's live telemetry advantage over ibnetdiscover or static labels; placed correctly before the Output section.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Cluster topology needed] --> B{Fabric type?}
    B -->|Cloud: AWS / GCP / OCI / Nebius| C[CSP provider]
    B -->|Spectrum-X or MNNVL infra| D{NetQ deployed?}
    B -->|InfiniBand, no NetQ| E{Deployment?}

    D -->|Yes| F[NetQ provider\nobserved live topology]
    D -->|No| E

    E -->|Bare-metal / Slurm| G[infiniband-bm\nibnetdiscover via pdsh]
    E -->|Kubernetes| H[infiniband-k8s\nibnetdiscover via pod exec]

    F --> I{Also need K8s MNNVL scheduling?}
    I -->|Yes| J[DRA provider\ncoexists with NetQ]
    I -->|No| K[Engine: slurm / k8s / slinky]
    G --> K
    H --> K
    C --> K

_{Reviews (2): Last reviewed commit: "docs: reconcile small-batch gaps from bl..." | Re-trigger Greptile}

greptile-apps · 2026-04-17T14:41:01Z

+
+## Areas where contributions are especially welcome
+
+- **Additional cloud and colocation providers.** If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's [`AGENTS.md`](./AGENTS.md) "Adding a New Provider" section.


Broken internal link to non-existent AGENTS.md

./AGENTS.md is referenced as the canonical guide for adding a new provider, but the file does not exist anywhere in the repository (confirmed via full tree search). Contributors following this link will hit a 404. Either this file needs to be created before (or in) this PR, or the reference should point to an existing doc that describes the provider interface.

Suggested change

- **Additional cloud and colocation providers.** If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's [`AGENTS.md`](./AGENTS.md) "Adding a New Provider" section.

- **Additional cloud and colocation providers.** If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's provider source under `pkg/providers/`.

Folds high-value context from the unpublished blog draft into the repository docs so the blog can reference canonical sources rather than being the sole host of this material. - README.md: expand "Motivation and Problem Statement" with the disaggregated inference scenario — prefill/decode separation, KV cache transfer, NVLink vs. Ethernet performance cliff - docs/providers/infiniband.md: add a "why automate IB discovery" paragraph with the scale argument (hand-maintained labels work at 32 nodes, break at 1,000) - docs/providers/netq.md: add "Observed vs. Intended Topology" section describing NetQ's distinctive ability to observe degraded-but-up links via live telemetry - CONTRIBUTING.md: add "Community" section with Kubernetes Slack channels (#topology-aware-scheduling, #gpu-nvidia) and a pointer to the pinned roadmap/focus-areas issue The KEP-4962 note on docs/reference/node-labels.md is intentionally deferred to a follow-up PR once NVIDIA#254 lands, since node-labels.md does not yet exist on main. Forward-looking contribution areas are captured as a pinned GitHub issue rather than inline in CONTRIBUTING.md, per the convention that CONTRIBUTING.md describes how to contribute while roadmap/direction lives in a pinned issue. Signed-off-by: Rob Esker <resker@nvidia.com>

Adds a short subsection under `## Labels` covering KEP-4962 ("Standardizing the Representation of Cluster Network Topology"), which is pre-GA and still under upstream review at kubernetes/enhancements#4962 (draft PR #4965). Notes that the KEP's framing allows vendor prefixes like `network.topology.nvidia.com/*` to coexist with the standard `topology.kubernetes.io/` keys rather than replace them, and that Topograph will evaluate aligning or providing both if and when the KEP reaches GA. For now, the `network.topology.nvidia.com/*` keys remain authoritative for Topograph-deployed clusters. This note was deferred from PR NVIDIA#264 because the target file (`docs/reference/node-labels.md`) did not yet exist on `main`; it was introduced by PR NVIDIA#254, which merged 2026-04-19. Signed-off-by: Rob Esker <resker@nvidia.com>

) * docs(reference): add KEP-4962 upstream-alignment note to node-labels.md Adds a short subsection under `## Labels` covering KEP-4962 ("Standardizing the Representation of Cluster Network Topology"), which is pre-GA and still under upstream review at kubernetes/enhancements#4962 (draft PR #4965). Notes that the KEP's framing allows vendor prefixes like `network.topology.nvidia.com/*` to coexist with the standard `topology.kubernetes.io/` keys rather than replace them, and that Topograph will evaluate aligning or providing both if and when the KEP reaches GA. For now, the `network.topology.nvidia.com/*` keys remain authoritative for Topograph-deployed clusters. This note was deferred from PR #264 because the target file (`docs/reference/node-labels.md`) did not yet exist on `main`; it was introduced by PR #254, which merged 2026-04-19. Signed-off-by: Rob Esker <resker@nvidia.com> * docs(agents): restore two Doc-Impact Evaluation table rows Adds back two rows to the Documentation Impact Evaluation table in `AGENTS.md` and `.claude/CLAUDE.md` that were removed from PR #269 to avoid cross-referencing content not yet on `main`: - Chart template row pointing at `docs/engines/k8s.md` "Exposing the Topograph API" section (added to main by PR #259) - Label or annotation key row pointing at `docs/reference/node-labels.md` (added to main by PR #254) Both gating PRs have now merged, so the rows can be restored without broken cross-references. Paired edit preserves the byte-identical invariant between `AGENTS.md` and `.claude/CLAUDE.md` from line 6 onward (verified with `cmp`). Signed-off-by: Rob Esker <resker@nvidia.com> --------- Signed-off-by: Rob Esker <resker@nvidia.com>

resker requested a review from dmitsh as a code owner April 17, 2026 14:39

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

resker force-pushed the docs/blog-gap-small-batch branch from b920d94 to d5ecd14 Compare April 17, 2026 14:45

dmitsh approved these changes Apr 17, 2026

View reviewed changes

dmitsh merged commit 52d2a8a into NVIDIA:main Apr 17, 2026
4 checks passed

resker mentioned this pull request Apr 19, 2026

docs: post-#254 follow-ups — KEP-4962 note + Doc-Impact table rows #278

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: small-batch reconciliation of blog draft content into repo docs#264

docs: small-batch reconciliation of blog draft content into repo docs#264
dmitsh merged 1 commit into
NVIDIA:mainfrom
resker:docs/blog-gap-small-batch

resker commented Apr 17, 2026

Uh oh!

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## Areas where contributions are especially welcome

		- Additional cloud and colocation providers. If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's [`AGENTS.md`](./AGENTS.md) "Adding a New Provider" section.

Conversation

resker commented Apr 17, 2026

Summary

Changes

Deferred to a follow-up

Test plan

Uh oh!

greptile-apps Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading