Skip to content

docs(reference): add authoritative node labels and annotations reference#254

Merged
dmitsh merged 6 commits into
NVIDIA:mainfrom
resker:docs/reference-node-labels
Apr 19, 2026
Merged

docs(reference): add authoritative node labels and annotations reference#254
dmitsh merged 6 commits into
NVIDIA:mainfrom
resker:docs/reference-node-labels

Conversation

@resker
Copy link
Copy Markdown
Collaborator

@resker resker commented Apr 17, 2026

Summary

Adds docs/reference/node-labels.md — the authoritative reference for every label and annotation key written by Topograph, requested in #179.

Covers:

  • The four default label keys (network.topology.nvidia.com/{accelerator,leaf,spine,core}) with topology type and semantics
  • Per-provider matrix showing which providers emit accelerator (block) vs. leaf/spine/core (tree), including value format details (IB providers produce ClusterUUID.CliqueId; NetQ uses NMX DomainUUID)
  • Relationship to nvidia.com/gpu.clique — on MNNVL systems the IB accelerator value matches; on non-MNNVL systems (B200/B300) gpu.clique is not set at all, making Topograph the only topology source
  • FNV-64a hash truncation behavior for label values >63 chars
  • Helm topologyNodeLabels configuration for customizing key prefixes
  • "Without Topograph" reference — standard Kubernetes + cloud provider + GPU Operator labels that are available by default
  • topograph.nvidia.com/ annotation keys (internal bookkeeping)
  • NVSentinel integration pattern for topology-aware fault blast-radius analysis

Test plan

  • Every label key verified against pkg/engines/k8s/labeler.go constants
  • Provider matrix cross-checked against each provider's topology emission
  • ClusterUUID.CliqueId format verified against pkg/providers/infiniband/common.go (nvidia-smi -q | grep "ClusterUUID\|CliqueId") and NVIDIA/k8s-device-plugin/internal/lm/imex.go
  • DomainUUID format verified against pkg/providers/netq/nmx.go
  • GPU_FABRIC_STATE_COMPLETED behavior verified against NVIDIA/go-nvlib/pkg/nvlib/device/device.go:IsFabricAttached()
  • FNV-64a behavior verified against pkg/engines/k8s/labeler.go:checkLabel

Closes #179

resker added 3 commits April 17, 2026 01:06
Documents all Kubernetes node labels and annotations set by Topograph,
including the four network.topology.nvidia.com/ labels written by the k8s
and slinky engines, the topograph.nvidia.com/ annotation keys, FNV-64a
hash truncation behavior for long values, and an NVSentinel integration
example.

Closes NVIDIA#179

Signed-off-by: Rob Esker <resker@nvidia.com>
… node-labels

- Clarify accelerator value format per provider: IB providers use
  ClusterUUID.CliqueId (same as nvidia.com/gpu.clique), NetQ uses
  NMX DomainUUID (different identifier format)
- Add note that gpu.clique is not set on non-MNNVL systems and
  Topograph is the only topology source in those environments
- Add nvidia.com/gpu.clique to the "Without Topograph" label table

Signed-off-by: Rob Esker <resker@nvidia.com>
The companion NVSentinel PR is not being filed at this time; removing
the dormant-link trailer lets the NVSentinel integration section stand
on its own as a self-contained reference.

Signed-off-by: Rob Esker <resker@nvidia.com>
@resker resker requested a review from dmitsh as a code owner April 17, 2026 06:11
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.46%. Comparing base (1875ab8) to head (4480ab4).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #254   +/-   ##
=======================================
  Coverage   68.46%   68.46%           
=======================================
  Files          82       82           
  Lines        4842     4842           
=======================================
  Hits         3315     3315           
  Misses       1395     1395           
  Partials      132      132           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

Adds docs/reference/node-labels.md as the authoritative reference for every label and annotation key written by Topograph (closes #179). All annotation key constants were verified against pkg/topology/topology.go, all four default label key constants were verified against pkg/engines/k8s/labeler.go, the FNV-64a hash truncation with x-prefix was confirmed from checkLabel(), the cw provider's incompatible vertex structure is now correctly documented, and the lambdai and netq block-value formats were cross-checked against their respective provider code.

Confidence Score: 5/5

Safe to merge — documentation only, all key facts verified against source code, one minor notation ambiguity that does not affect correctness.

All P0/P1 concerns from prior review rounds have been addressed: the x-prefixed FNV hash is now documented correctly, and cw/lambdai are now included in the matrix with accurate descriptions. The single remaining finding (lambdai value format notation) is a P2 style suggestion that does not affect functional correctness.

No files require special attention.

Important Files Changed

Filename Overview
docs/reference/node-labels.md New authoritative reference for all Topograph node labels and annotations; all annotation keys, label key constants, provider matrix entries, and hash-truncation behavior verified against source code — one minor notation ambiguity in the lambdai block value format.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Provider["Provider\n(aws / gcp / oci / nebius / netq /\ninfiniband-bm / infiniband-k8s /\nlambdai / cw / dra)"]
    Provider -->|GenerateTopologyConfig| Root["Vertex root"]
    Root -->|topology/block| BlockRoot["Block root\n(optional)"]
    Root -->|topology/tree| TreeRoot["Tree root\n(optional)"]

    BlockRoot --> Block["Block vertex\n(NVLink domain / CapacityBlock)"]
    Block --> NodeB["Compute node"]

    TreeRoot --> Core["Core switch\n(optional)"]
    Core --> Spine["Spine switch\n(optional)"]
    Spine --> Leaf["Leaf switch"]
    Leaf --> NodeT["Compute node"]

    Block -->|checkLabel| AccLabel["network.topology.nvidia.com/accelerator"]
    Leaf -->|checkLabel| LeafLabel["network.topology.nvidia.com/leaf"]
    Spine -->|checkLabel| SpineLabel["network.topology.nvidia.com/spine"]
    Core -->|checkLabel| CoreLabel["network.topology.nvidia.com/core"]

    checkLabel{"len > 63?"}
    AccLabel --> checkLabel
    checkLabel -->|No| RawVal["raw value"]
    checkLabel -->|Yes| Hash["x-prefixed FNV-64a hex\ne.g. x3e4f1a2b3c4d5e6f"]
Loading

Reviews (4): Last reviewed commit: "docs(reference): fix broken engine links..." | Re-trigger Greptile

Comment thread docs/reference/node-labels.md Outdated

### Label value behavior

Label values are used as-is when they are 63 characters or shorter (the Kubernetes label value limit). Values longer than 63 characters are replaced with their **FNV-64a hash** (hex-encoded) to stay within the limit. This means two nodes with the same long switch identifier will carry the same hash value — locality is preserved, but the original identifier is not recoverable from the label alone.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 FNV-64a hash value includes an x prefix

The description says "hex-encoded," but checkLabel() in pkg/engines/k8s/labeler.go formats the value as fmt.Sprintf("x%x", h.Sum64()) — the resulting label value is a lowercase hex string with a leading x (e.g., x3e4f1a2b3c4d5e6f), not a bare hex string. Operators who grep for or compare label values need to know this prefix is present.

Suggested change
Label values are used as-is when they are 63 characters or shorter (the Kubernetes label value limit). Values longer than 63 characters are replaced with their **FNV-64a hash** (hex-encoded) to stay within the limit. This means two nodes with the same long switch identifier will carry the same hash value — locality is preserved, but the original identifier is not recoverable from the label alone.
Label values are used as-is when they are 63 characters or shorter (the Kubernetes label value limit). Values longer than 63 characters are replaced with their **FNV-64a hash** (formatted as `x` followed by lowercase hex, e.g., `x3e4f1a2b3c4d5e6f`) to stay within the limit. This means two nodes with the same long switch identifier will carry the same hash value — locality is preserved, but the original identifier is not recoverable from the label alone.

The hash is rendered as fmt.Sprintf("x%x", h.Sum64()) in
pkg/engines/k8s/labeler.go:checkLabel, producing an x-prefixed lowercase
hex string. Previous wording said 'hex-encoded' without the prefix,
which matters for operators parsing or filtering label values.

Signed-off-by: Rob Esker <resker@nvidia.com>
Comment on lines +23 to +32
|---|---|---|
| `aws` | Yes (CapacityBlockId) | Yes |
| `gcp` | No | Yes |
| `oci` | No | Yes |
| `nebius` | No | Yes |
| `netq` | Yes (NMX `DomainUUID`) | Yes (Spectrum-X switch hierarchy) |
| `dra` | Yes (reads `nvidia.com/gpu.clique`) | No |
| `infiniband-bm` | Yes (`ClusterUUID.CliqueId`) | Yes (IB switch hierarchy) |
| `infiniband-k8s` | Yes (`ClusterUUID.CliqueId`) | Yes (IB switch hierarchy) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Provider matrix omits lambdai and cw

Two registered providers are absent from the matrix even though both write topology labels. lambdai (pkg/providers/lambdai/instance_topology.go) sets AcceleratorID = NVLink.DomainID + "." + NVLink.CliqueID and passes it through ToThreeTierGraph, so it emits both block (accelerator) and tree (leaf/spine/core) labels. cw (pkg/providers/cw/provider.go) calls ib.GenerateTopologyConfig and wraps the output as a bare tree root (no toGraph() call), so it emits tree-only labels.

Suggested additions to the matrix:

Suggested change
|---|---|---|
| `aws` | Yes (CapacityBlockId) | Yes |
| `gcp` | No | Yes |
| `oci` | No | Yes |
| `nebius` | No | Yes |
| `netq` | Yes (NMX `DomainUUID`) | Yes (Spectrum-X switch hierarchy) |
| `dra` | Yes (reads `nvidia.com/gpu.clique`) | No |
| `infiniband-bm` | Yes (`ClusterUUID.CliqueId`) | Yes (IB switch hierarchy) |
| `infiniband-k8s` | Yes (`ClusterUUID.CliqueId`) | Yes (IB switch hierarchy) |
| `aws` | Yes (CapacityBlockId) | Yes |
| `cw` | No | Yes (IB switch hierarchy) |
| `gcp` | No | Yes |
| `lambdai` | Yes (`NVLink.DomainID.CliqueID`) | Yes |
| `oci` | No | Yes |
| `nebius` | No | Yes |
| `netq` | Yes (NMX `DomainUUID`) | Yes (Spectrum-X switch hierarchy) |
| `dra` | Yes (reads `nvidia.com/gpu.clique`) | No |
| `infiniband-bm` | Yes (`ClusterUUID.CliqueId`) | Yes (IB switch hierarchy) |
| `infiniband-k8s` | Yes (`ClusterUUID.CliqueId`) | Yes (IB switch hierarchy) |

resker added a commit to resker/topograph that referenced this pull request Apr 17, 2026
Folds high-value context from the unpublished blog draft into the
repository docs so the blog can reference canonical sources rather
than being the sole host of this material.

- README.md: expand "Motivation and Problem Statement" with the
  disaggregated inference scenario — prefill/decode separation,
  KV cache transfer, NVLink vs. Ethernet performance cliff
- docs/providers/infiniband.md: add a "why automate IB discovery"
  paragraph with the scale argument (hand-maintained labels work at
  32 nodes, break at 1,000)
- docs/providers/netq.md: add "Observed vs. Intended Topology"
  section describing NetQ's distinctive ability to observe
  degraded-but-up links via live telemetry
- CONTRIBUTING.md: add "Community" section with Kubernetes Slack
  channels (#topology-aware-scheduling, #gpu-nvidia) and "Areas
  where contributions are especially welcome" covering new
  providers, on-prem fabrics, KEP-5732 Workload API, KEP-4962
  upstream label schema, and Grove ClusterTopology integration

The KEP-4962 note on docs/reference/node-labels.md is intentionally
deferred to a follow-up PR once NVIDIA#254 lands, since node-labels.md
does not yet exist on main.

Signed-off-by: Rob Esker <resker@nvidia.com>
resker added a commit to resker/topograph that referenced this pull request Apr 17, 2026
Folds high-value context from the unpublished blog draft into the
repository docs so the blog can reference canonical sources rather
than being the sole host of this material.

- README.md: expand "Motivation and Problem Statement" with the
  disaggregated inference scenario — prefill/decode separation,
  KV cache transfer, NVLink vs. Ethernet performance cliff
- docs/providers/infiniband.md: add a "why automate IB discovery"
  paragraph with the scale argument (hand-maintained labels work at
  32 nodes, break at 1,000)
- docs/providers/netq.md: add "Observed vs. Intended Topology"
  section describing NetQ's distinctive ability to observe
  degraded-but-up links via live telemetry
- CONTRIBUTING.md: add "Community" section with Kubernetes Slack
  channels (#topology-aware-scheduling, #gpu-nvidia) and a pointer
  to the pinned roadmap/focus-areas issue

The KEP-4962 note on docs/reference/node-labels.md is intentionally
deferred to a follow-up PR once NVIDIA#254 lands, since node-labels.md
does not yet exist on main. Forward-looking contribution areas are
captured as a pinned GitHub issue rather than inline in
CONTRIBUTING.md, per the convention that CONTRIBUTING.md describes
how to contribute while roadmap/direction lives in a pinned issue.

Signed-off-by: Rob Esker <resker@nvidia.com>
dmitsh pushed a commit that referenced this pull request Apr 17, 2026
…264)

Folds high-value context from the unpublished blog draft into the
repository docs so the blog can reference canonical sources rather
than being the sole host of this material.

- README.md: expand "Motivation and Problem Statement" with the
  disaggregated inference scenario — prefill/decode separation,
  KV cache transfer, NVLink vs. Ethernet performance cliff
- docs/providers/infiniband.md: add a "why automate IB discovery"
  paragraph with the scale argument (hand-maintained labels work at
  32 nodes, break at 1,000)
- docs/providers/netq.md: add "Observed vs. Intended Topology"
  section describing NetQ's distinctive ability to observe
  degraded-but-up links via live telemetry
- CONTRIBUTING.md: add "Community" section with Kubernetes Slack
  channels (#topology-aware-scheduling, #gpu-nvidia) and a pointer
  to the pinned roadmap/focus-areas issue

The KEP-4962 note on docs/reference/node-labels.md is intentionally
deferred to a follow-up PR once #254 lands, since node-labels.md
does not yet exist on main. Forward-looking contribution areas are
captured as a pinned GitHub issue rather than inline in
CONTRIBUTING.md, per the convention that CONTRIBUTING.md describes
how to contribute while roadmap/direction lives in a pinned issue.

Signed-off-by: Rob Esker <resker@nvidia.com>
Addresses Greptile's P1 finding on NVIDIA#254. Both are registered providers
that write topology labels but were missing from the matrix:

- lambdai: emits ClusterUUID.CliqueID-style accelerator via
  NVLink.DomainID + NVLink.CliqueID, plus tree labels via
  ToThreeTierGraph (pkg/providers/lambdai/instance_topology.go)
- cw: calls ib.GenerateTopologyConfig and wraps the output as a bare
  tree root (pkg/providers/cw/provider.go), emitting tree labels only

Signed-off-by: Rob Esker <resker@nvidia.com>
Comment thread docs/reference/node-labels.md Outdated
| Provider | Block (`accelerator`) | Tree (`leaf`/`spine`/`core`) |
|---|---|---|
| `aws` | Yes (CapacityBlockId) | Yes |
| `cw` | No | Yes (IB switch hierarchy) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 cw tree column is incorrect — no labels are emitted

cw.GenerateTopologyConfig returns a raw treeRoot vertex where children are keyed by switch IDs, not wrapped under the "topology/tree" key. Both the k8s labeler (ApplyNodeLabels, labeler.go:77) and the Slinky engine's initTree (translate/topology.go:118) gate all tree processing on root.Vertices[topology.TopologyTree] — a key that is never present in the cw output. As a result, the cw provider produces no node labels at all through either engine.

Compare cw's return value:

// cw/provider.go — switch IDs as keys, no "topology/tree" wrapper
treeRoot := &topology.Vertex{Vertices: make(map[string]*topology.Vertex)}
for _, v := range roots {
    treeRoot.Vertices[v.ID] = v  // keys are switch identifiers
}
return treeRoot, nil

With netq's correct structure:

// netq/provider.go — properly wrapped
root.Vertices[topology.TopologyTree] = treeRoot

The tree column for cw should be "No" until the provider is updated to wrap its output, or the code should be fixed and the doc updated to "Yes" once the provider emits the expected structure.

Suggested change
| `cw` | No | Yes (IB switch hierarchy) |
| `cw` | No | No (vertex structure incompatible with labeler — see pkg/providers/cw/provider.go) |

Two fixes from the PR NVIDIA#254 review:

- Fern Check was failing because lines 7 points at `../engines.md`,
  which is not a single file — `docs/engines/` is a directory with
  per-engine files. Fixed the Kubernetes and Slinky engine links to
  point at `../engines/k8s.md` and `../engines/slinky.md`
  respectively.

- Greptile's latest P1 finding is correct: `cw.GenerateTopologyConfig`
  returns a tree root whose children are switch IDs directly, not
  wrapped under `topology.TopologyTree`. Both the k8s labeler
  (`ApplyNodeLabels`, `labeler.go:77`) and Slinky's `initTree`
  (`translate/topology.go:118`) gate processing on
  `root.Vertices[topology.TopologyTree]`, which is never present in
  cw's output. Compare with `netq/provider.go:92` which correctly
  wraps `root.Vertices[topology.TopologyTree] = treeRoot`. The cw
  provider emits zero labels today; updated the matrix row to
  reflect current behavior with a note about the underlying issue.

Signed-off-by: Rob Esker <resker@nvidia.com>
@dmitsh dmitsh merged commit e3fd2b3 into NVIDIA:main Apr 19, 2026
3 checks passed
resker added a commit to resker/topograph that referenced this pull request Apr 19, 2026
Adds a short subsection under `## Labels` covering KEP-4962
("Standardizing the Representation of Cluster Network Topology"),
which is pre-GA and still under upstream review at
kubernetes/enhancements#4962 (draft PR #4965). Notes that the KEP's
framing allows vendor prefixes like `network.topology.nvidia.com/*`
to coexist with the standard `topology.kubernetes.io/` keys rather
than replace them, and that Topograph will evaluate aligning or
providing both if and when the KEP reaches GA. For now, the
`network.topology.nvidia.com/*` keys remain authoritative for
Topograph-deployed clusters.

This note was deferred from PR NVIDIA#264 because the target file
(`docs/reference/node-labels.md`) did not yet exist on `main`; it
was introduced by PR NVIDIA#254, which merged 2026-04-19.

Signed-off-by: Rob Esker <resker@nvidia.com>
resker added a commit to resker/topograph that referenced this pull request Apr 19, 2026
Adds back two rows to the Documentation Impact Evaluation table in
`AGENTS.md` and `.claude/CLAUDE.md` that were removed from PR NVIDIA#269
to avoid cross-referencing content not yet on `main`:

- Chart template row pointing at `docs/engines/k8s.md` "Exposing the
  Topograph API" section (added to main by PR NVIDIA#259)
- Label or annotation key row pointing at `docs/reference/node-labels.md`
  (added to main by PR NVIDIA#254)

Both gating PRs have now merged, so the rows can be restored without
broken cross-references. Paired edit preserves the byte-identical
invariant between `AGENTS.md` and `.claude/CLAUDE.md` from line 6
onward (verified with `cmp`).

Signed-off-by: Rob Esker <resker@nvidia.com>
dmitsh pushed a commit that referenced this pull request Apr 19, 2026
)

* docs(reference): add KEP-4962 upstream-alignment note to node-labels.md

Adds a short subsection under `## Labels` covering KEP-4962
("Standardizing the Representation of Cluster Network Topology"),
which is pre-GA and still under upstream review at
kubernetes/enhancements#4962 (draft PR #4965). Notes that the KEP's
framing allows vendor prefixes like `network.topology.nvidia.com/*`
to coexist with the standard `topology.kubernetes.io/` keys rather
than replace them, and that Topograph will evaluate aligning or
providing both if and when the KEP reaches GA. For now, the
`network.topology.nvidia.com/*` keys remain authoritative for
Topograph-deployed clusters.

This note was deferred from PR #264 because the target file
(`docs/reference/node-labels.md`) did not yet exist on `main`; it
was introduced by PR #254, which merged 2026-04-19.

Signed-off-by: Rob Esker <resker@nvidia.com>

* docs(agents): restore two Doc-Impact Evaluation table rows

Adds back two rows to the Documentation Impact Evaluation table in
`AGENTS.md` and `.claude/CLAUDE.md` that were removed from PR #269
to avoid cross-referencing content not yet on `main`:

- Chart template row pointing at `docs/engines/k8s.md` "Exposing the
  Topograph API" section (added to main by PR #259)
- Label or annotation key row pointing at `docs/reference/node-labels.md`
  (added to main by PR #254)

Both gating PRs have now merged, so the rows can be restored without
broken cross-references. Paired edit preserves the byte-identical
invariant between `AGENTS.md` and `.claude/CLAUDE.md` from line 6
onward (verified with `cmp`).

Signed-off-by: Rob Esker <resker@nvidia.com>

---------

Signed-off-by: Rob Esker <resker@nvidia.com>
dmitsh pushed a commit that referenced this pull request Apr 22, 2026
…284)

* docs(fern): restore Reference section to nav

`docs/reference/node-labels.md` has been on main since PR #254 merged
(2026-04-19) but is not listed in `docs/index.yml`, the Fern nav
source-of-truth. The last successful Fern publish (2026-04-20T16:52Z)
resolved 13 pages — exactly the count declared in the nav without a
Reference section — confirming the page is invisible on the live site
at https://topograph.docs.buildwithfern.com/topograph despite being on
the filesystem.

Add a Reference section listing `reference/node-labels.md`. This is the
paired-file update that PR #254 should have included.

Verified filesystem vs. nav discrepancy:

| docs/ file                       | In index.yml? |
|----------------------------------|---------------|
| overview.md                      | yes           |
| architecture.md + api.md         | yes           |
| providers/{aws,gcp,oci,nebius,...| yes (7)       |
| engines/{slurm,k8s,slinky}.md    | yes (3)       |
| reference/node-labels.md         | NO (fixed)    |

Signed-off-by: Rob Esker <resker@nvidia.com>

* docs(fern): label current version as v0.3.0

The published Fern site shows a version label of "dev" because
`fern/docs.yml` declared the only version as `display-name: dev`. The
repo is at tagged release v0.3.0 with no divergence between the released
content and the in-flight docs, so "dev" is misleading: readers see
"development docs" framing when what they're actually looking at
corresponds to the v0.3.0 tag.

Rename the version to `v0.3.0`. No content split yet — both versions
would point at the same `docs/index.yml` anyway. When content actually
diverges (post-v0.4.0 breaking doc changes, or a doc change that only
applies to the future release), re-introduce a separate `dev` entry
alongside v0.3.0 and repoint paths to differentiated content.

Drop the pre-existing `/topograph/dev/index.html` -> `/topograph/dev/`
redirect because the `/dev/` destination no longer exists after the
rename. Not adding `/dev` -> new-version forwarding: site has only
been live since 2026-04-17 (four days), so dev-URL bookmarks are
unlikely; the Fern root redirect + sidebar navigation recovers any
user who lands on a 404. If dead-bookmark complaints surface later,
adding a one-line redirect is trivial.

Verified: `fern check` reports 0 errors (1 pre-existing warning about
the NVIDIA-green accent-color contrast ratio, unrelated to this PR).
`fern docs dev` renders the sidebar with all five sections intact
(Getting Started, Architecture, Providers, Engines, Reference) and
the version label reads "v0.3.0".

Signed-off-by: Rob Esker <resker@nvidia.com>

* chore(ci): surface fern errors in publish-fern-docs workflow

The publish step used `OUTPUT=$(fern generate --docs 2>&1)` to capture
fern's output so the script could grep for the "Published docs to" URL
and post it to the GitHub Actions step summary. Side effect: on
non-zero exit from `fern generate`, `bash -e` aborted the step BEFORE
the subsequent `echo "$OUTPUT"` ran, so fern's error output never
reached the Actions log. All three failed publishes between 2026-04-17
and 2026-04-20 (plus the 2026-04-18 manual dispatches) logged only
"Process completed with exit code 1" with no diagnostic context.

Switch to `fern generate ... 2>&1 | tee /tmp/fern-output.log`. The
`tee` streams fern's stdout/stderr to the Actions log in real time
(visible in both success and failure cases) and mirrors to a file that
the URL-extraction grep reads after. `set -o pipefail` is added so the
step still fails on fern's non-zero exit (tee's own exit status would
otherwise mask it). The `|| true` on grep ensures a missing URL in
fern's output does not fail the step on its own.

Net effect: future publish failures are diagnosable from the Actions
log directly. No behavior change on the success path.

No ability to test the failure path locally without reproducing a
deliberate fern error; validated only that the happy-path shell
snippet is syntactically valid. Future failures will surface the
actual fern error, which is the whole point.

Signed-off-by: Rob Esker <resker@nvidia.com>

---------

Signed-off-by: Rob Esker <resker@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Question: What is the authoritative set of node labels we can look at for determining topology in Kubernetes cluster.

2 participants