Skip to content

docs(get-started): add install-path quickstarts for Kubernetes and Slurm#292

Merged
dmitsh merged 1 commit into
NVIDIA:mainfrom
resker:docs/get-started-quickstarts
Apr 24, 2026
Merged

docs(get-started): add install-path quickstarts for Kubernetes and Slurm#292
dmitsh merged 1 commit into
NVIDIA:mainfrom
resker:docs/get-started-quickstarts

Conversation

@resker
Copy link
Copy Markdown
Collaborator

@resker resker commented Apr 23, 2026

Description

Fills the Getting Started gap that a new reader hits on https://topograph.docs.buildwithfern.com/topograph: the "Getting Started" section in the Fern nav currently contains only Overview, and the root README has no Quick Start. To actually install Topograph, a reader has to navigate to charts/topograph/README.md (chart-adjacent, not surfaced from the docs tree) or to docs/engines/slurm.md (buried under Engines, not Getting Started).

This PR adds two install-focused quickstarts under a new docs/get-started/ directory:

  • Install on Kubernetes — covers both the k8s engine (native K8s scheduling via node labels) and the slinky engine (Slurm-on-Kubernetes via ConfigMap) in one page with anchored subsections, since the two engines share the same Helm chart, the same prerequisites, and the same helm test verification. Engine-specific flags and verification are in anchored subsections so the Fern TOC exposes them.
  • Install on Slurm — bare-metal Slurm install via make deb / make rpm + systemd. Separate file because the install mechanism, prerequisites, config surface, and verification are different paradigms from Helm-based Kubernetes installs — merging would force every section into `if helm ... else rpm ...` branches.

Each page is scoped narrowly: prerequisites → install → verify → where to go next. The Getting Started experience is the on-ramp; the existing engine references (`docs/engines/k8s.md`, `docs/engines/slurm.md`, `docs/engines/slinky.md`), chart README, provider docs, and API reference remain the authoritative deeper references and are linked from the "Where to go next" section on each page. Tutorial-depth content (demo workloads, KAI Scheduler gang-scheduling examples, `podAffinity` walkthroughs) is deliberately out of scope for these pages; that belongs in the Tech Blog tutorial and in future `docs/integrations/` pages.

Also adds:

  • A Quick Start section to the root `README.md` that picks between the two paths (Kubernetes vs Slurm), so a reader landing via GitHub sees install guidance before the "Learn more" links.
  • Two new entries under the Fern nav's existing "Getting Started" section (`docs/index.yml`): "Install on Kubernetes" and "Install on Slurm".

Design notes

  • K8s + Slinky in one page: the two engines share the Helm chart, prerequisites, install command base (varies only in `--set global.engine.name=...` and a few engine params), and `helm test` verification. A single page with section anchors avoids ~60% content duplication.
  • Slurm separate: shares only the "pick a provider" concept and the `/healthz` endpoint with the K8s path. The install surface is fundamentally different.
  • Kubernetes 1.27+: the quickstart-k8s.md Prerequisites section states 1.27+. This is enforced at install time by chore(chart): declare kubeVersion ">=1.27.0-0" in chart + subcharts #291 (`chore(chart): declare kubeVersion ">=1.27.0-0"`). Once chore(chart): declare kubeVersion ">=1.27.0-0" in chart + subcharts #291 merges, the claim is machine-checked; until then it's documentary.

Checklist

  • Documentation Impact Evaluation — new docs directory added; Fern nav + root README updated in the same PR to reflect it
  • `make qualify` — N/A (docs-only, no Go changes)
  • Every commit has a DCO sign-off

Adds a Getting Started path that gets an operator from "I heard about
Topograph" to "Topograph is installed and healthy" without requiring
them to navigate into `charts/topograph/README.md` or
`docs/engines/slurm.md` directly.

Two pages under `docs/get-started/`:

- `quickstart-k8s.md` — covers both the `k8s` and `slinky` engines
  (same Helm chart, same prerequisites, same `helm test` verification;
  engine-specific flags and verification in anchored subsections). The
  k8s+slinky merge avoids duplicating the shared install surface.

- `quickstart-slurm.md` — bare-metal Slurm install via `make deb` /
  `make rpm` + systemd. Separate file because the install mechanism,
  prerequisites, config surface, and verification are different
  paradigms from Helm-based K8s.

Both pages are install-focused: prerequisites, install command,
verification, and "Where to go next" pointers into the existing engine
references, chart README, provider docs, and API reference. They
deliberately do not duplicate tutorial-depth content (demo workloads,
KAI gang-scheduling examples) — those belong in the Tech Blog tutorial
and future integration docs.

Also adds a root README "Quick Start" section that picks between the
two paths, and two new entries under the Fern nav's existing
"Getting Started" section.

Signed-off-by: Rob Esker <resker@nvidia.com>
@resker resker requested a review from dmitsh as a code owner April 23, 2026 03:34
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR adds two install-focused quickstart pages (docs/get-started/quickstart-k8s.md and docs/get-started/quickstart-slurm.md), wires them into the Fern nav (docs/index.yml), and adds a Quick Start section to the root README.md to surface install guidance earlier in the reader journey. The content is well-scoped (prerequisites → install → verify → next steps) and correctly defers deep configuration detail to existing engine references and the chart README.

Confidence Score: 5/5

Safe to merge — docs-only PR with two minor P2 wording suggestions in the K8s quickstart.

All findings are P2: a slight conflation in the helm test description and a version-prerequisite note acknowledged as contingent on companion PR #291. Neither blocks users from successfully installing. Content is accurate, well-structured, and consistent with existing references.

docs/get-started/quickstart-k8s.md — minor wording improvements suggested around the helm test description and the Kubernetes 1.27 prerequisite.

Important Files Changed

Filename Overview
docs/get-started/quickstart-k8s.md New Kubernetes quickstart covering k8s and slinky engines; minor wording inaccuracy in helm test description and a Kubernetes version prerequisite that is contingent on a companion PR merging.
docs/get-started/quickstart-slurm.md New bare-metal Slurm quickstart; content is accurate and well-structured with no issues found.
README.md Adds a Quick Start section linking the two new paths; concise and consistent with the new quickstart pages.
docs/index.yml Adds Install on Kubernetes and Install on Slurm entries to the Getting Started Fern nav section; paths are correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([Reader lands on docs / README]) --> B{Scheduler type?}
    B -->|Kubernetes| C[quickstart-k8s.md]
    B -->|Bare-metal Slurm| D[quickstart-slurm.md]

    C --> E{Engine?}
    E -->|k8s| F["helm install --set global.engine.name=k8s"]
    E -->|slinky| G["helm install --set global.engine.name=slinky + ConfigMap params"]

    F --> H["helm test topograph --namespace topograph"]
    G --> H
    H --> I["kubectl get nodes --show-labels (k8s)\nor kubectl get configmap (slinky)"]

    D --> J["make deb / make rpm"]
    J --> K["Edit /etc/topograph/topograph-config.yaml"]
    K --> L["systemctl enable --now topograph.service"]
    L --> M["curl http://localhost:49021/healthz"]
Loading

Reviews (1): Last reviewed commit: "docs(get-started): add install-path quic..." | Re-trigger Greptile

helm test topograph --namespace topograph
```

The bundled test hooks probe `/healthz` and `/metrics` inside the cluster and expect HTTP 200 plus the `topograph_version` metric in the response body. A green result confirms the API server is running.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 helm test description conflates two separate probes

The chart README (and chart source) describes two distinct test hook pods: test-healthz probes /healthz and expects HTTP 200, while test-metrics probes /metrics and expects the topograph_version metric. The current phrasing — "probe /healthz and /metrics… and expect HTTP 200 plus the topograph_version metric in the response body" — implies both probes require topograph_version, which could lead a user to incorrectly suspect a failure if /healthz doesn't emit Prometheus text.

Suggested change
The bundled test hooks probe `/healthz` and `/metrics` inside the cluster and expect HTTP 200 plus the `topograph_version` metric in the response body. A green result confirms the API server is running.
The bundled test hooks probe `/healthz` (expects HTTP 200) and `/metrics` (expects HTTP 200 plus the `topograph_version` metric in the response body) inside the cluster. A green result confirms the API server is running.


## Prerequisites

- **Kubernetes**: 1.27 or later
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Kubernetes version floor contradicts existing chart README

This line declares "1.27 or later" as a hard prerequisite, but charts/topograph/README.md currently states: "no hard floor is declared by this chart; the rendered manifests use only apps/v1, rbac.authorization.k8s.io/v1, and v1, all stable since Kubernetes 1.9." The PR description acknowledges the 1.27 floor is only enforced once companion PR #291 (kubeVersion: ">=1.27.0-0") merges. Until then a reader who consults both documents gets conflicting information. Consider adding a note pointing to the chart README, or delaying this line until #291 has landed.

@dmitsh dmitsh merged commit 26ed9b2 into NVIDIA:main Apr 24, 2026
6 checks passed
dmitsh pushed a commit that referenced this pull request Apr 29, 2026
PR #268 removed the `branches: [main]` push filter from
`publish-fern-docs.yml` while keeping the `tags: [docs/v*]` filter. With
GitHub Actions, defining `tags:` without a corresponding `branches:`
restricts push events to tag refs only — branch pushes (including main)
no longer trigger the workflow even when their changed paths match the
`paths:` filter.

Symptom: the live Fern site at
https://topograph.docs.buildwithfern.com/topograph has not been
republished since the manual workflow_dispatch on 2026-04-20T16:51Z.
PRs #284, #289, #290, #291, and #292 all touched docs/ but produced
zero workflow runs. The Reference section restored by #284 (and the
clique-semantics clarifications added by #289) are on main but invisible
on the published site.

Restore `branches: [main]` so push events to main with docs/ or fern/
changes resume triggering publishes. Tag pushes for `docs/v*` and manual
`workflow_dispatch` continue to work unchanged.

To clear the backlog after this PR merges, dispatch the workflow
manually one time:

  gh workflow run publish-fern-docs.yml --repo NVIDIA/topograph --ref main

Signed-off-by: Rob Esker <resker@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants