docs: small-batch reconciliation of blog draft content into repo docs#264
Conversation
Greptile SummaryFour additive, independent documentation changes that fold content from an unpublished blog draft into the repo: a disaggregated-inference paragraph in Confidence Score: 5/5Safe to merge — docs-only, no code changes, all four additions are technically accurate and well-placed. All changes are additive documentation with no code modifications. Content is technically sound and consistent with the existing provider docs. No broken links introduced in the changed content, and no P0/P1 findings. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Cluster topology needed] --> B{Fabric type?}
B -->|Cloud: AWS / GCP / OCI / Nebius| C[CSP provider]
B -->|Spectrum-X or MNNVL infra| D{NetQ deployed?}
B -->|InfiniBand, no NetQ| E{Deployment?}
D -->|Yes| F[NetQ provider\nobserved live topology]
D -->|No| E
E -->|Bare-metal / Slurm| G[infiniband-bm\nibnetdiscover via pdsh]
E -->|Kubernetes| H[infiniband-k8s\nibnetdiscover via pod exec]
F --> I{Also need K8s MNNVL scheduling?}
I -->|Yes| J[DRA provider\ncoexists with NetQ]
I -->|No| K[Engine: slurm / k8s / slinky]
G --> K
H --> K
C --> K
Reviews (2): Last reviewed commit: "docs: reconcile small-batch gaps from bl..." | Re-trigger Greptile |
|
|
||
| ## Areas where contributions are especially welcome | ||
|
|
||
| - **Additional cloud and colocation providers.** If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's [`AGENTS.md`](./AGENTS.md) "Adding a New Provider" section. |
There was a problem hiding this comment.
Broken internal link to non-existent
AGENTS.md
./AGENTS.md is referenced as the canonical guide for adding a new provider, but the file does not exist anywhere in the repository (confirmed via full tree search). Contributors following this link will hit a 404. Either this file needs to be created before (or in) this PR, or the reference should point to an existing doc that describes the provider interface.
| - **Additional cloud and colocation providers.** If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's [`AGENTS.md`](./AGENTS.md) "Adding a New Provider" section. | |
| - **Additional cloud and colocation providers.** If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's provider source under `pkg/providers/`. |
Folds high-value context from the unpublished blog draft into the repository docs so the blog can reference canonical sources rather than being the sole host of this material. - README.md: expand "Motivation and Problem Statement" with the disaggregated inference scenario — prefill/decode separation, KV cache transfer, NVLink vs. Ethernet performance cliff - docs/providers/infiniband.md: add a "why automate IB discovery" paragraph with the scale argument (hand-maintained labels work at 32 nodes, break at 1,000) - docs/providers/netq.md: add "Observed vs. Intended Topology" section describing NetQ's distinctive ability to observe degraded-but-up links via live telemetry - CONTRIBUTING.md: add "Community" section with Kubernetes Slack channels (#topology-aware-scheduling, #gpu-nvidia) and a pointer to the pinned roadmap/focus-areas issue The KEP-4962 note on docs/reference/node-labels.md is intentionally deferred to a follow-up PR once NVIDIA#254 lands, since node-labels.md does not yet exist on main. Forward-looking contribution areas are captured as a pinned GitHub issue rather than inline in CONTRIBUTING.md, per the convention that CONTRIBUTING.md describes how to contribute while roadmap/direction lives in a pinned issue. Signed-off-by: Rob Esker <resker@nvidia.com>
b920d94 to
d5ecd14
Compare
Adds a short subsection under `## Labels` covering KEP-4962
("Standardizing the Representation of Cluster Network Topology"),
which is pre-GA and still under upstream review at
kubernetes/enhancements#4962 (draft PR #4965). Notes that the KEP's
framing allows vendor prefixes like `network.topology.nvidia.com/*`
to coexist with the standard `topology.kubernetes.io/` keys rather
than replace them, and that Topograph will evaluate aligning or
providing both if and when the KEP reaches GA. For now, the
`network.topology.nvidia.com/*` keys remain authoritative for
Topograph-deployed clusters.
This note was deferred from PR NVIDIA#264 because the target file
(`docs/reference/node-labels.md`) did not yet exist on `main`; it
was introduced by PR NVIDIA#254, which merged 2026-04-19.
Signed-off-by: Rob Esker <resker@nvidia.com>
) * docs(reference): add KEP-4962 upstream-alignment note to node-labels.md Adds a short subsection under `## Labels` covering KEP-4962 ("Standardizing the Representation of Cluster Network Topology"), which is pre-GA and still under upstream review at kubernetes/enhancements#4962 (draft PR #4965). Notes that the KEP's framing allows vendor prefixes like `network.topology.nvidia.com/*` to coexist with the standard `topology.kubernetes.io/` keys rather than replace them, and that Topograph will evaluate aligning or providing both if and when the KEP reaches GA. For now, the `network.topology.nvidia.com/*` keys remain authoritative for Topograph-deployed clusters. This note was deferred from PR #264 because the target file (`docs/reference/node-labels.md`) did not yet exist on `main`; it was introduced by PR #254, which merged 2026-04-19. Signed-off-by: Rob Esker <resker@nvidia.com> * docs(agents): restore two Doc-Impact Evaluation table rows Adds back two rows to the Documentation Impact Evaluation table in `AGENTS.md` and `.claude/CLAUDE.md` that were removed from PR #269 to avoid cross-referencing content not yet on `main`: - Chart template row pointing at `docs/engines/k8s.md` "Exposing the Topograph API" section (added to main by PR #259) - Label or annotation key row pointing at `docs/reference/node-labels.md` (added to main by PR #254) Both gating PRs have now merged, so the rows can be restored without broken cross-references. Paired edit preserves the byte-identical invariant between `AGENTS.md` and `.claude/CLAUDE.md` from line 6 onward (verified with `cmp`). Signed-off-by: Rob Esker <resker@nvidia.com> --------- Signed-off-by: Rob Esker <resker@nvidia.com>
Summary
Folds high-value content from an unpublished blog draft into the repository docs so the blog can reference canonical sources rather than being the sole host of this material. All four changes are additive, independent, and low-risk.
Changes
README.md— expand the "Motivation and Problem Statement" section (merged via docs(readme): add 'Motivation and Problem Statement' section #247) with the disaggregated inference scenario: prefill/decode separation, KV cache transfer across the worker boundary, and the NVLink vs. Ethernet performance cliff. Complements the existing distributed-training framing.docs/providers/infiniband.md— add a "Why automate IB discovery?" paragraph with the scale argument: hand-maintained labels work at ~32 nodes, break at 1,000 with fabric churn.docs/providers/netq.md— add an "Observed vs. Intended Topology" section describing NetQ's distinctive ability to observe degraded-but-up links via live telemetry. This signal is invisible toibnetdiscover-based discovery and unreported by any cloud placement API.CONTRIBUTING.md— expand beyond the current 62 lines of DCO boilerplate with:#topology-aware-scheduling,#gpu-nvidia)ClusterTopologyintegrationDeferred to a follow-up
One additional small-batch item from the same blog-gap analysis — a KEP-4962 "Upstream alignment" paragraph in
docs/reference/node-labels.md— is intentionally deferred to a follow-up PR, since that file does not yet exist onmain(it's introduced by #254).Test plan
pkg/registry/registry.go