docs(get-started): add install-path quickstarts for Kubernetes and Slurm#292
Conversation
Adds a Getting Started path that gets an operator from "I heard about Topograph" to "Topograph is installed and healthy" without requiring them to navigate into `charts/topograph/README.md` or `docs/engines/slurm.md` directly. Two pages under `docs/get-started/`: - `quickstart-k8s.md` — covers both the `k8s` and `slinky` engines (same Helm chart, same prerequisites, same `helm test` verification; engine-specific flags and verification in anchored subsections). The k8s+slinky merge avoids duplicating the shared install surface. - `quickstart-slurm.md` — bare-metal Slurm install via `make deb` / `make rpm` + systemd. Separate file because the install mechanism, prerequisites, config surface, and verification are different paradigms from Helm-based K8s. Both pages are install-focused: prerequisites, install command, verification, and "Where to go next" pointers into the existing engine references, chart README, provider docs, and API reference. They deliberately do not duplicate tutorial-depth content (demo workloads, KAI gang-scheduling examples) — those belong in the Tech Blog tutorial and future integration docs. Also adds a root README "Quick Start" section that picks between the two paths, and two new entries under the Fern nav's existing "Getting Started" section. Signed-off-by: Rob Esker <resker@nvidia.com>
Greptile SummaryThis PR adds two install-focused quickstart pages ( Confidence Score: 5/5Safe to merge — docs-only PR with two minor P2 wording suggestions in the K8s quickstart. All findings are P2: a slight conflation in the helm test description and a version-prerequisite note acknowledged as contingent on companion PR #291. Neither blocks users from successfully installing. Content is accurate, well-structured, and consistent with existing references. docs/get-started/quickstart-k8s.md — minor wording improvements suggested around the helm test description and the Kubernetes 1.27 prerequisite. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([Reader lands on docs / README]) --> B{Scheduler type?}
B -->|Kubernetes| C[quickstart-k8s.md]
B -->|Bare-metal Slurm| D[quickstart-slurm.md]
C --> E{Engine?}
E -->|k8s| F["helm install --set global.engine.name=k8s"]
E -->|slinky| G["helm install --set global.engine.name=slinky + ConfigMap params"]
F --> H["helm test topograph --namespace topograph"]
G --> H
H --> I["kubectl get nodes --show-labels (k8s)\nor kubectl get configmap (slinky)"]
D --> J["make deb / make rpm"]
J --> K["Edit /etc/topograph/topograph-config.yaml"]
K --> L["systemctl enable --now topograph.service"]
L --> M["curl http://localhost:49021/healthz"]
Reviews (1): Last reviewed commit: "docs(get-started): add install-path quic..." | Re-trigger Greptile |
| helm test topograph --namespace topograph | ||
| ``` | ||
|
|
||
| The bundled test hooks probe `/healthz` and `/metrics` inside the cluster and expect HTTP 200 plus the `topograph_version` metric in the response body. A green result confirms the API server is running. |
There was a problem hiding this comment.
helm test description conflates two separate probes
The chart README (and chart source) describes two distinct test hook pods: test-healthz probes /healthz and expects HTTP 200, while test-metrics probes /metrics and expects the topograph_version metric. The current phrasing — "probe /healthz and /metrics… and expect HTTP 200 plus the topograph_version metric in the response body" — implies both probes require topograph_version, which could lead a user to incorrectly suspect a failure if /healthz doesn't emit Prometheus text.
| The bundled test hooks probe `/healthz` and `/metrics` inside the cluster and expect HTTP 200 plus the `topograph_version` metric in the response body. A green result confirms the API server is running. | |
| The bundled test hooks probe `/healthz` (expects HTTP 200) and `/metrics` (expects HTTP 200 plus the `topograph_version` metric in the response body) inside the cluster. A green result confirms the API server is running. |
|
|
||
| ## Prerequisites | ||
|
|
||
| - **Kubernetes**: 1.27 or later |
There was a problem hiding this comment.
Kubernetes version floor contradicts existing chart README
This line declares "1.27 or later" as a hard prerequisite, but charts/topograph/README.md currently states: "no hard floor is declared by this chart; the rendered manifests use only apps/v1, rbac.authorization.k8s.io/v1, and v1, all stable since Kubernetes 1.9." The PR description acknowledges the 1.27 floor is only enforced once companion PR #291 (kubeVersion: ">=1.27.0-0") merges. Until then a reader who consults both documents gets conflicting information. Consider adding a note pointing to the chart README, or delaying this line until #291 has landed.
PR #268 removed the `branches: [main]` push filter from `publish-fern-docs.yml` while keeping the `tags: [docs/v*]` filter. With GitHub Actions, defining `tags:` without a corresponding `branches:` restricts push events to tag refs only — branch pushes (including main) no longer trigger the workflow even when their changed paths match the `paths:` filter. Symptom: the live Fern site at https://topograph.docs.buildwithfern.com/topograph has not been republished since the manual workflow_dispatch on 2026-04-20T16:51Z. PRs #284, #289, #290, #291, and #292 all touched docs/ but produced zero workflow runs. The Reference section restored by #284 (and the clique-semantics clarifications added by #289) are on main but invisible on the published site. Restore `branches: [main]` so push events to main with docs/ or fern/ changes resume triggering publishes. Tag pushes for `docs/v*` and manual `workflow_dispatch` continue to work unchanged. To clear the backlog after this PR merges, dispatch the workflow manually one time: gh workflow run publish-fern-docs.yml --repo NVIDIA/topograph --ref main Signed-off-by: Rob Esker <resker@nvidia.com>
Description
Fills the Getting Started gap that a new reader hits on
https://topograph.docs.buildwithfern.com/topograph: the "Getting Started" section in the Fern nav currently contains only Overview, and the root README has no Quick Start. To actually install Topograph, a reader has to navigate tocharts/topograph/README.md(chart-adjacent, not surfaced from the docs tree) or todocs/engines/slurm.md(buried under Engines, not Getting Started).This PR adds two install-focused quickstarts under a new
docs/get-started/directory:k8sengine (native K8s scheduling via node labels) and theslinkyengine (Slurm-on-Kubernetes via ConfigMap) in one page with anchored subsections, since the two engines share the same Helm chart, the same prerequisites, and the samehelm testverification. Engine-specific flags and verification are in anchored subsections so the Fern TOC exposes them.make deb/make rpm+ systemd. Separate file because the install mechanism, prerequisites, config surface, and verification are different paradigms from Helm-based Kubernetes installs — merging would force every section into `if helm ... else rpm ...` branches.Each page is scoped narrowly: prerequisites → install → verify → where to go next. The Getting Started experience is the on-ramp; the existing engine references (`docs/engines/k8s.md`, `docs/engines/slurm.md`, `docs/engines/slinky.md`), chart README, provider docs, and API reference remain the authoritative deeper references and are linked from the "Where to go next" section on each page. Tutorial-depth content (demo workloads, KAI Scheduler gang-scheduling examples, `podAffinity` walkthroughs) is deliberately out of scope for these pages; that belongs in the Tech Blog tutorial and in future `docs/integrations/` pages.
Also adds:
Design notes
Checklist