Skip to content

docs: small-batch reconciliation of blog draft content into repo docs#264

Merged
dmitsh merged 1 commit into
NVIDIA:mainfrom
resker:docs/blog-gap-small-batch
Apr 17, 2026
Merged

docs: small-batch reconciliation of blog draft content into repo docs#264
dmitsh merged 1 commit into
NVIDIA:mainfrom
resker:docs/blog-gap-small-batch

Conversation

@resker
Copy link
Copy Markdown
Collaborator

@resker resker commented Apr 17, 2026

Summary

Folds high-value content from an unpublished blog draft into the repository docs so the blog can reference canonical sources rather than being the sole host of this material. All four changes are additive, independent, and low-risk.

Changes

  • README.md — expand the "Motivation and Problem Statement" section (merged via docs(readme): add 'Motivation and Problem Statement' section #247) with the disaggregated inference scenario: prefill/decode separation, KV cache transfer across the worker boundary, and the NVLink vs. Ethernet performance cliff. Complements the existing distributed-training framing.

  • docs/providers/infiniband.md — add a "Why automate IB discovery?" paragraph with the scale argument: hand-maintained labels work at ~32 nodes, break at 1,000 with fabric churn.

  • docs/providers/netq.md — add an "Observed vs. Intended Topology" section describing NetQ's distinctive ability to observe degraded-but-up links via live telemetry. This signal is invisible to ibnetdiscover-based discovery and unreported by any cloud placement API.

  • CONTRIBUTING.md — expand beyond the current 62 lines of DCO boilerplate with:

    • Community section — Kubernetes Slack channels (#topology-aware-scheduling, #gpu-nvidia)
    • Areas where contributions are especially welcome — new providers, on-premises network fabrics, Kubernetes Workload API / KEP-5732, KEP-4962 upstream label schema, Grove ClusterTopology integration

Deferred to a follow-up

One additional small-batch item from the same blog-gap analysis — a KEP-4962 "Upstream alignment" paragraph in docs/reference/node-labels.md — is intentionally deferred to a follow-up PR, since that file does not yet exist on main (it's introduced by #254).

Test plan

  • Markdown renders cleanly in GitHub preview for all four files
  • All external links verified (KEP-5732, KEP-4962, Grove, Kubernetes Workload API, Kubernetes Slack channel archives)
  • Provider list in CONTRIBUTING.md (AWS, GCP, OCI, Nebius, Lambda AI, CoreWeave) verified against pkg/registry/registry.go
  • No conflicts with other open PRs — this PR touches README, infiniband.md, and netq.md in sections that do not overlap with the edits in open PR docs: document gpu.clique relationship and non-MNNVL topology source #255

@resker resker requested a review from dmitsh as a code owner April 17, 2026 14:39
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

Four additive, independent documentation changes that fold content from an unpublished blog draft into the repo: a disaggregated-inference paragraph in README.md, a scale-motivation paragraph in docs/providers/infiniband.md, an "Observed vs. Intended Topology" section in docs/providers/netq.md, and a Community section with Kubernetes Slack links in CONTRIBUTING.md. No code is changed.

Confidence Score: 5/5

Safe to merge — docs-only, no code changes, all four additions are technically accurate and well-placed.

All changes are additive documentation with no code modifications. Content is technically sound and consistent with the existing provider docs. No broken links introduced in the changed content, and no P0/P1 findings.

No files require special attention.

Important Files Changed

Filename Overview
README.md Adds a disaggregated inference paragraph to the Motivation section, illustrating KV-cache transfer cost and NVLink vs. Ethernet performance cliff; technically accurate and well-integrated into the existing narrative.
CONTRIBUTING.md Adds a Community section with Kubernetes Slack links and a reference to the pinned Roadmap issue; expansion is clean and doesn't introduce any broken references in the current content.
docs/providers/infiniband.md Prepends a "Why automate IB discovery?" paragraph giving a concrete scale argument (~32 vs 1,000 nodes); fits naturally before the existing provider choice guidance.
docs/providers/netq.md Inserts an "Observed vs. Intended Topology" section highlighting NetQ's live telemetry advantage over ibnetdiscover or static labels; placed correctly before the Output section.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Cluster topology needed] --> B{Fabric type?}
    B -->|Cloud: AWS / GCP / OCI / Nebius| C[CSP provider]
    B -->|Spectrum-X or MNNVL infra| D{NetQ deployed?}
    B -->|InfiniBand, no NetQ| E{Deployment?}

    D -->|Yes| F[NetQ provider\nobserved live topology]
    D -->|No| E

    E -->|Bare-metal / Slurm| G[infiniband-bm\nibnetdiscover via pdsh]
    E -->|Kubernetes| H[infiniband-k8s\nibnetdiscover via pod exec]

    F --> I{Also need K8s MNNVL scheduling?}
    I -->|Yes| J[DRA provider\ncoexists with NetQ]
    I -->|No| K[Engine: slurm / k8s / slinky]
    G --> K
    H --> K
    C --> K
Loading

Reviews (2): Last reviewed commit: "docs: reconcile small-batch gaps from bl..." | Re-trigger Greptile

Comment thread CONTRIBUTING.md Outdated

## Areas where contributions are especially welcome

- **Additional cloud and colocation providers.** If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's [`AGENTS.md`](./AGENTS.md) "Adding a New Provider" section.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Broken internal link to non-existent AGENTS.md

./AGENTS.md is referenced as the canonical guide for adding a new provider, but the file does not exist anywhere in the repository (confirmed via full tree search). Contributors following this link will hit a 404. Either this file needs to be created before (or in) this PR, or the reference should point to an existing doc that describes the provider interface.

Suggested change
- **Additional cloud and colocation providers.** If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's [`AGENTS.md`](./AGENTS.md) "Adding a New Provider" section.
- **Additional cloud and colocation providers.** If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider. The provider interface is documented in the repository's provider source under `pkg/providers/`.

Folds high-value context from the unpublished blog draft into the
repository docs so the blog can reference canonical sources rather
than being the sole host of this material.

- README.md: expand "Motivation and Problem Statement" with the
  disaggregated inference scenario — prefill/decode separation,
  KV cache transfer, NVLink vs. Ethernet performance cliff
- docs/providers/infiniband.md: add a "why automate IB discovery"
  paragraph with the scale argument (hand-maintained labels work at
  32 nodes, break at 1,000)
- docs/providers/netq.md: add "Observed vs. Intended Topology"
  section describing NetQ's distinctive ability to observe
  degraded-but-up links via live telemetry
- CONTRIBUTING.md: add "Community" section with Kubernetes Slack
  channels (#topology-aware-scheduling, #gpu-nvidia) and a pointer
  to the pinned roadmap/focus-areas issue

The KEP-4962 note on docs/reference/node-labels.md is intentionally
deferred to a follow-up PR once NVIDIA#254 lands, since node-labels.md
does not yet exist on main. Forward-looking contribution areas are
captured as a pinned GitHub issue rather than inline in
CONTRIBUTING.md, per the convention that CONTRIBUTING.md describes
how to contribute while roadmap/direction lives in a pinned issue.

Signed-off-by: Rob Esker <resker@nvidia.com>
@resker resker force-pushed the docs/blog-gap-small-batch branch from b920d94 to d5ecd14 Compare April 17, 2026 14:45
@dmitsh dmitsh merged commit 52d2a8a into NVIDIA:main Apr 17, 2026
4 checks passed
resker added a commit to resker/topograph that referenced this pull request Apr 19, 2026
Adds a short subsection under `## Labels` covering KEP-4962
("Standardizing the Representation of Cluster Network Topology"),
which is pre-GA and still under upstream review at
kubernetes/enhancements#4962 (draft PR #4965). Notes that the KEP's
framing allows vendor prefixes like `network.topology.nvidia.com/*`
to coexist with the standard `topology.kubernetes.io/` keys rather
than replace them, and that Topograph will evaluate aligning or
providing both if and when the KEP reaches GA. For now, the
`network.topology.nvidia.com/*` keys remain authoritative for
Topograph-deployed clusters.

This note was deferred from PR NVIDIA#264 because the target file
(`docs/reference/node-labels.md`) did not yet exist on `main`; it
was introduced by PR NVIDIA#254, which merged 2026-04-19.

Signed-off-by: Rob Esker <resker@nvidia.com>
dmitsh pushed a commit that referenced this pull request Apr 19, 2026
)

* docs(reference): add KEP-4962 upstream-alignment note to node-labels.md

Adds a short subsection under `## Labels` covering KEP-4962
("Standardizing the Representation of Cluster Network Topology"),
which is pre-GA and still under upstream review at
kubernetes/enhancements#4962 (draft PR #4965). Notes that the KEP's
framing allows vendor prefixes like `network.topology.nvidia.com/*`
to coexist with the standard `topology.kubernetes.io/` keys rather
than replace them, and that Topograph will evaluate aligning or
providing both if and when the KEP reaches GA. For now, the
`network.topology.nvidia.com/*` keys remain authoritative for
Topograph-deployed clusters.

This note was deferred from PR #264 because the target file
(`docs/reference/node-labels.md`) did not yet exist on `main`; it
was introduced by PR #254, which merged 2026-04-19.

Signed-off-by: Rob Esker <resker@nvidia.com>

* docs(agents): restore two Doc-Impact Evaluation table rows

Adds back two rows to the Documentation Impact Evaluation table in
`AGENTS.md` and `.claude/CLAUDE.md` that were removed from PR #269
to avoid cross-referencing content not yet on `main`:

- Chart template row pointing at `docs/engines/k8s.md` "Exposing the
  Topograph API" section (added to main by PR #259)
- Label or annotation key row pointing at `docs/reference/node-labels.md`
  (added to main by PR #254)

Both gating PRs have now merged, so the rows can be restored without
broken cross-references. Paired edit preserves the byte-identical
invariant between `AGENTS.md` and `.claude/CLAUDE.md` from line 6
onward (verified with `cmp`).

Signed-off-by: Rob Esker <resker@nvidia.com>

---------

Signed-off-by: Rob Esker <resker@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants