Skip to content

Add NVLink#469

Open
danbedford wants to merge 32 commits into
Syllo:masterfrom
danbedford:nvlink
Open

Add NVLink#469
danbedford wants to merge 32 commits into
Syllo:masterfrom
danbedford:nvlink

Conversation

@danbedford
Copy link
Copy Markdown
Contributor

@danbedford danbedford commented May 3, 2026

NVTop NVLink Fork - Changelog

Upstream: Syllo/nvtop (commit 095d91c "Remove unused function in ixml")
Fork: danbedford/nvtop, branch nvlink
GPU Tested: NVIDIA GeForce RTX 3090
Scope: 5 files changed, 706 insertions(+), 19 deletions(-)


Overview

Extends nvtop with per-GPU NVLink info in unused space of the existing interface. When no NVLink-supported GPU is detected, layout and behavior are identical to upstream -- no visual or functional difference. The goal is to bring useful data and throughput to all users of nvtop with NVLink-supported hardware, from consumer (2080, 3090 series) to datacenter (Ampere, Hopper, Blackwell series).

NVLink Supported Device Example:

NVLSupported

NVLink Connected Device Example:

NVLConnected

Main bar (row 2, shown by default)

Appended at end after power_info -- NVLink version, link count, and aggregate throughput displayed. Two display states:

NVLink supported device - No bridge or no active links (0-link case, no row 2 padding compaction applied):

NVL5 0x

With active links (row 2 padding compaction applied, throughput displayed). Example (theoretical fully saturated GB200 with NVLink 5.0):

NVL518x 1.636 TiB/s

When NVLink is supported but no bridge is connected or links are inactive, only the version and link count display -- no compaction is applied to reclaim space on row 2 since there is no throughput to display. The NVL5 0x text extends past the panel edge without affecting the layout. Only when active links exist does fan field compaction kick in (11 to 8 characters) to make room for the throughput value.

  • Label: NVL to represent minimal Label for NVLink.
  • Version: Marketing NVLink version via NVIDIA Management Library (NVML) nvmlDeviceGetNvLinkVersion (raw NVML enum values require remapping):
    • Raw 1 -> NVLink V1.0 -> Display 1
    • Raw 2 -> V2.0 -> 2
    • Raw 3 -> V2.2 -> 2
    • Raw 4 -> V3.0 -> 3
    • Raw 5 -> V3.1 -> 3
    • Raw 6 -> V4.0 -> 4
    • Raw 7 -> V5.0 -> 5
    • Raw 8 -> V6.0 -> 6 (assumed Rubin)
      Display shows single-digit major version due to limited space.
  • Link count: Total physical links on the device (static hardware property). Maximum is 36 to future-proof for planned Nvidia Rubin.
  • Throughput: Aggregate Transmit plus Receive utilization, currently read via nvidia-smi CLI fallback for all NVLink-connected GPUs. This carries measurable overhead from forking a full binary and parsing its text output, but providing real throughput visibility to consumer GPU users outweighs the cost, and all other non-NVLink users are isolated. The 2-second interval is hardcoded and independent of global nvtop refresh rate to cap CLI calls regardless of display speed. Uses "r" (raw) counters which include payload plus protocol overhead, reflecting true bandwidth utilization. Parses "Link N: Raw Tx: NNNNN KiB" / "Raw Rx" per link. Delta = (current - previous) / time_delta per link, summed for aggregate; unsigned underflow guard checks new >= old before subtraction. No smoothing applied -- raw accuracy over display smoothness. TODO: On datacenter GPUs with nvmlDeviceGetNvLinkUtilizationCounter, replace with direct API call; keep CLI fallback for consumer GPUs.
  • Layout compaction: The Fan field shrinks from 11 to 8 characters ONLY when any_device_has_nvlink_active is true (at least one monitored GPU has active NVLink links). GPUs with NVLink hardware but no bridge (0-link case) do NOT get compaction -- NVL3 0x extends past the panel edge without needing reclaimed space. Panel width is determined by device name length (device_name column = largest_device_name + 11), so longer names produce more room for NVLink.
  • Throughput display: Uses print_data_at_scale() (renamed from print_pcie_at_scale()) with IEC binary prefixes. Array bounds check extended from < 5 to < 6 to support up to Terbibytes/s (TiB/s) for Blackwell NVLink 5.0 devices at ~1.636 TiB/s aggregate. The memory_prefix[] array already contains entries up to "Pi" -- only the loop guard needed updating.

Extra GPU info bar (row 4, not shown by default)

Appended at end after exec_engines -- error and correction counters since nvtop launch. Example with zeroed counters:

NVL E:00000 C:00000

Example with non-zero counters (errors in red, corrections in yellow):

NVL E:00420 C:00069
  • Label: NVL to represent minimal Label for NVLink.
  • Error counters: Replay, recovery, CRC FLIT, and CRC DATA errors via nvmlDeviceGetNvLinkErrorCounter, summed across all links. Baseline subtraction ensures counters start at zero on nvtop launch.
  • CRC corrections: Per-lane CRC flit corrections via nvmlDeviceGetFieldValues (field IDs 32-247 for links 0-35), summed across all links. Uses modern signature (device, valuesCount, fieldValues) with field IDs populated in-place in the nvmlFieldValue_t buffer (48 bytes on NVML 11.515+: fieldId at offset 0, scopeId at 4, timestamp at 8, latencyUsec at 16, valueType at 24, nvmlReturn at 28, value.union at 32). Offsets are manually parsed since nvml.h is not exposed in the nvtop build.
  • Error counters read during refresh cycle (gpuinfo_nvidia_refresh_dynamic_info()), not during startup probe (nvtop_probe_nvlink_list() calls nvtop_get_nvlink_info() before display is drawn). This ensures the baseline is established at the moment of first display refresh, guaranteeing counters read zero on launch. nvtop_get_nvlink_info() does NOT read error counters in the display path.

Files Changed

include/nvtop/extract_gpuinfo_common.h (+31 lines, -1 line)

  • NVTOP_NVLINK_MAX_LINKS defined to 36
  • Flat struct nvlink_info: num_links, version, supported, has_throughput, aggregate_tx, aggregate_rx, total_errors, total_corrections
  • nvtop_get_nvlink_info(): return cached NVLink data; vendor guard skips non-NVIDIA GPUs before container_of()
  • nvtop_get_nvlink_error_counts(): public getter for display-ready error/correction counts; bridges interface.c to per-device error state in extract_gpuinfo_nvidia.c
  • nvtop_probe_nvlink_list(): probe all devices for NVLink support before curses init; short-circuits if any_device_has_nvlink already true
  • nvtop_set_nvlink_probe(): set any_device_has_nvlink global flag only (leaves any_device_has_nvlink_active untouched)
  • nvtop_reset_nvlink_cache(): reset all per-device NVLink caching (probe flag, cached linkcount, cached version, cached info struct) on monitored GPU set change; vendor guard for non-NVIDIA

include/nvtop/interface_internal_common.h (+4 lines, -1 line)

  • WINDOW *nvlink_info added to struct device_window (row 2 throughput)
  • WINDOW *nvlink_errors added to struct device_window (row 4 errors)
  • device_nvlink_errors added to enum device_field with size 19

src/extract_gpuinfo_nvidia.c (+451 insertions, -1 deletion)

  • Four NVML symbols via dlsym(): nvmlDeviceGetNvLinkState, nvmlDeviceGetNvLinkVersion, nvmlDeviceGetNvLinkErrorCounter, nvmlDeviceGetFieldValues (modern 3-param signature)
  • Per-device state: device_index, cli_poll_active, per-link CLI counters, baseline/display error fields, probe cache (nvlink_probed, nvlink_cached_linkcount, nvlink_cached_version), full struct cache (cached_nvlink_info, cached_nvlink_info_populated)
  • Link discovery: probes links 0-35 via nvmlDeviceGetNvLinkState, counts consecutive successes, stops on first hard error or NVML_ERROR_NOT_SUPPORTED; only active links (isActive == 1) are counted -- physical slots with no bridge are excluded
  • Caching: 3 layers -- (1) link count/version via nvlink_probe_and_cache(), (2) full struct via nvlink_refresh_cached_info(), (3) list-level probe short-circuit in nvtop_probe_nvlink_list(); all reset by nvtop_reset_nvlink_cache() on GPU set change
  • Throughput: nvidia-smi nvlink --getthroughput r -i <dev> every 2 seconds (hardcoded, independent of display refresh rate), delta-based rate computation with unsigned underflow guard
  • Error reading via nvlink_read_errors(): called from gpuinfo_nvidia_refresh_dynamic_info() (not nvtop_get_nvlink_info()) to ensure baseline is established at first display refresh; reads errors via nvmlDeviceGetNvLinkErrorCounter and corrections via nvmlDeviceGetFieldValues; unsigned underflow guard prevents counter wrap artifacts

src/interface.c (+220 insertions, -20 deletions)

  • Conditional layout: any_device_has_nvlink controls window allocation; any_device_has_nvlink_active controls fan compaction (shrinks from 11 to 8 chars only when active links exist -- 0-link devices do not get compaction)
  • device_length() always uses base layout (clock + mem_clock + temp + fan + power + 5) regardless of NVLink state; NVLink window on line 2 extends past nominal panel edge, which ncurses handles gracefully
  • nvtop_adjust_field_sizes_for_nvlink() checks any_device_has_nvlink_active (not any_device_has_nvlink) for fan compaction
  • interface_check_monitored_gpu_change() resets ALL mutable NVLink state: both global flags plus sizeof_device_field[device_fan_speed] = 11, then calls per-device nvtop_reset_nvlink_cache()
  • Fan N/A fallback branch uses any_device_has_nvlink_active for correct 11-character format on 0-link devices
  • NVLink info window (row 2): displays print_data_at_scale()-formatted throughput (renamed from print_pcie_at_scale(); bounds check extended to < 6 for TiB/s ceiling)
  • NVLink errors window (row 4): reads via nvtop_get_nvlink_error_counts() (does NOT call nvtop_get_nvlink_info() in display path)
  • Memory leak fixes: added missing delwin() for shader_cores, l2_cache_size, exec_engines, plots[i].plot_window, and nvlink_errors. Two of these are also submitted as standalone upstream PRs: free_device_windows() fix (PR fix: add missing delwin() calls in free_device_windows #467) and plots[i].plot_window fix (PR fix: free plot_window in delete_all_windows #468).

src/nvtop.c (+5 lines, -1 line)

  • nvtop_probe_nvlink_list() and nvtop_set_nvlink_probe() called before curses init (first layout pass)
  • Re-evaluated in main loop after interface_check_monitored_gpu_change() for GPU hotplug

Design Decisions

Flat struct over nested

Single struct per device. Error and correction counters are cumulative totals (unsigned long long) summed across all links. Avoids per-link arrays and dynamic allocation in the hot refresh path.

Two-tier error state: baseline plus display

Five fields in struct gpu_info_nvidia track error state: baseline_errors, baseline_corrections, nvlink_errors_baseline_read (bool), display_errors, display_corrections. Baselines persist for the entire process lifetime. Display values computed each refresh as cumulative - baseline.

total_errors / total_corrections retained for API compatibility

Populated from display_errors/display_corrections in nvlink_refresh_cached_info(). Primary display path uses nvtop_get_nvlink_error_counts(), but both carry the same data.

No new dependencies

Uses only NVML symbols already in nvidia-ml driver and nvidia-smi binary already on the system.


What Was Not Changed

Process listing, memory/GPU charts, configuration options, keyboard shortcuts, menu behavior, and all non-NVLink display fields remain identical to upstream.


Testing

Dual RTX 3090 Founders Edition 24GB with 3-slot NVLink Bridge (RTXA6000NVLINK3S-KIT). Displays in UI as: NVIDIA GeForce RTX 3090. 4 physical links per GPU. enum nvmlNvlinkVersion_t returns 5 representing NVLink v3.1. When idle, NVLink shows ~1.2 MiB/s aggregate residual throughput from assumed protocol keep-alives/link maintenance. Errors/corrections correctly display E:00000 C:00000 on every launch, incrementing only when new errors occur (no errors experienced to fully confirm).

danbedford and others added 27 commits April 29, 2026 11:35
- Add nvlink_info and nvlink_link_info structs to extract_gpuinfo_common.h
- Add NVML function pointers for NVLink (link count, state, throughput, errors, ECC)
- Add nvtop_get_nvlink_info() function in extract_gpuinfo_nvidia.c
- Track throughput counters for delta-based rate calculation
- Gracefully handle missing NVLink support (no hard failure on consumer GPUs)
- Add nvlink_info window to device_window struct
- Allocate NVLink window on line 2 of device info block
- Shift all subsequent rows down by 1 to accommodate
- Add NVLink rendering: per-link status (A/x), TX/RX throughput, error indicators
- Color coding: green=active, red=inactive or errors present
- Update device_header_rows from 3/4 to 4/5 in layout calculation
- Replace nvmlDeviceGetNvLinkLinkCount (doesn't exist) with link
  discovery via nvmlDeviceGetNvLinkState probe loop
- Replace nvmlDeviceGetNvLinkThroughput (doesn't exist) with
  nvmlDeviceGetNvLinkUtilizationCounter (returns both RX and TX)
- Remove nvmlDeviceGetNvLinkRemoteDeviceInfo (doesn't exist)
- Remove nvmlDeviceGetNvLinkEccCounter (doesn't exist, covered by
  nvmlDeviceGetNvLinkErrorCounter with type DL_ECC_DATA)
- Skip throughput display on consumer GPUs where utilization
  counters return NVML_ERROR_NOT_SUPPORTED
- Show all 4 links (L0A L1A L2A L3A) on RTX 3090 instead of N/A
- Flatten struct nvlink_info (no nested link_info array)
- Throughput via nvidia-smi CLI (poll every 2s) instead of NVML utilization counters
- Conditional layout: any_device_has_nvlink flag controls spacing
- Revert to exact upstream layout when no NVLink GPU detected
- Marketing version remapping with device name overrides for RTX 3090
- Reuse print_pcie_at_scale() for throughput formatting
- Only 2 dlsym'd NVML symbols: GetNvLinkState, GetNvLinkVersion
…e scale function

Add NVLink error and CRC correction counter display on line 4 of the
device panel, showing cumulative errors (replay, recovery, CRC FLIT,
CRC DATA) and per-lane CRC flit corrections with conditional coloring
(errors in red, corrections in yellow). Counters use baseline
subtraction so they start at zero on nvtop launch and only increment
when new errors/corrections occur.

Display format: NVL E:00000 C:00000 (19 chars), with "NVL" in cyan
and numeric values conditionally colored. Window is allocated only
for devices with NVLink support.

- src/extract_gpuinfo_nvidia.c: Add nvlink_read_errors() function
  using nvmlDeviceGetNvLinkErrorCounter and nvmlDeviceGetFieldValues
  with baseline tracking per-device.

- include/nvtop/extract_gpuinfo_common.h: Add total_errors and
  total_corrections fields to struct nvlink_info.

- src/interface.c: Add nvlink_errors window allocation, deallocation,
  and display logic in draw_devices().

- include/nvtop/interface_internal_common.h: Add nvlink_errors window
  pointer and device_nvlink_errors enum entry.

- Rename print_pcie_at_scale() to print_data_at_scale() and extend
  loop from 5 to 6 to support TiB/s (future-proofing for NVLink 5.0).

- Fix FAN field width (8 -> 11 chars) and reduce spacing to fit all
  fields within the device panel width.

- Fix device_length() to use max() across all three device lines
  instead of only lines 1 and 2.
…ayout, update display getter

Move nvlink_read_errors() out of nvtop_get_nvlink_info() and into
gpuinfo_nvidia_refresh_dynamic_info() so the baseline is not established
during the startup probe. This ensures E:00000 C:00000 on every nvtop
launch.

- Add display_errors/display_corrections fields to struct gpu_info_nvidia
- Add nvtop_get_nvlink_error_counts() public getter (extract_gpuinfo_common.h)
- Update interface.c to use the getter instead of nvtop_get_nvlink_info()
- Fix nvmlFieldValue_t struct offsets: 48 bytes (not 12), ullVal at offset 32
- Fix dlsym signature for nvmlDeviceGetFieldValues (remove fieldIds parameter)
- Populate fieldId in-place in the raw buffer before calling GetFieldValues
…hroughput r)

Use raw (payload + protocol overhead) counters instead of data-only for
the nvidia-smi CLI fallback path. This ensures fully saturated links
show the rated link speed (e.g. ~14.062 GB/s per link on NVLink 3.0)
rather than roughly half that from data-only counters.

- Changed --getthroughput d to --getthroughput r
- Updated parsing from 'Data Tx/Rx' to 'Raw Tx/Rx'
- Added explanatory code comments
…ment

Add a code comment guiding future developers with datacenter NVLink
hardware (A100, H100) to replace the CLI fallback with the NVML
nvmlDeviceGetNvLinkUtilizationCounter API, while keeping the CLI as a
consumer GPU fallback. Also remove the misleading 'EMA smoothing' comment
on the aggregate throughput output since no smoothing is actually applied.
…ccuracy

Remove Exponential Moving Average smoothing from the nvidia-smi CLI
throughput fallback path. Raw delta/time_delta is used directly without
smoothing — accuracy is more important than display smoothness for a
monitoring tool.
…nterval

- NVTOP_NVLINK_MAX_LINKS and NVML_NVLINK_MAX_LINKS_INTERNAL increased from 18 to 36
  for future-proof support of devices with up to 36 NVLink links.
- Add explicit comment: 2-second nvidia-smi CLI poll interval is hardcoded and
  independent of global refresh rate, minimizing resource usage for this resource-
  heavy process (full binary fork + text parsing).
- Update code comments with expanded field ID range (32-247 for links 0-35).
…plot_window, and unsigned underflow guard

- free_device_windows: delwin() for shader_cores, l2_cache_size,
  exec_engines (upstream PR Syllo#467 fix/memory-leaks-in-free-device-windows)
- delete_all_windows: delwin() for plots[i].plot_window
  (upstream PR Syllo#468 fix/plot-window-memory-leak)
- nvtop_get_nvlink_info: guard against unsigned underflow in CLI
  throughput delta if hardware counter wraps or resets
- print_data_at_scale: change parameter from unsigned int to unsigned long long
  to prevent 32-bit truncation on high-throughput NVLink hardware (e.g. B100/GB200)
- Remove duplicate #include <stdio.h> / #include <string.h> mid-file
  (already included at the top of extract_gpuinfo_nvidia.c)
- Remove redundant forward declaration of struct gpu_info_nvidia
  (struct already fully defined earlier in the same file)
- Remove unused NVM_LVALUE_VALUE_TYPE_OFF macro
…efresh cycle

- Add nvlink_cached_linkcount and nvlink_cached_version to
  struct gpu_info_nvidia (static hardware properties, probe once)
- Add nvlink_probe_and_cache() helper that probes all links and
  caches results on first call, returns cached value thereafter
- Replace inline probe loop in refresh_dynamic_info() with
  nvlink_probe_and_cache() call (also fixes hardcoded limit of 18
  -> now uses full NVML_NVLINK_MAX_LINKS_INTERNAL of 36)
- Replace inline probe loop in nvtop_get_nvlink_info() with
  nvlink_probe_and_cache() call, reads version from cache
- Eliminates up to 36 NVML API calls per GPU per refresh cycle
…t-change reset

- Add early return in nvtop_probe_nvlink_list() when
  any_device_has_nvlink is already true — NVLink support is a
  static hardware property, no need to re-probe every refresh cycle
- Reset any_device_has_nvlink in
  interface_check_monitored_gpu_change() when monitored set changes,
  so the user can switch between NVLink and non-NVLink GPUs without
  the cache becoming stale
Add case 8: return 6 to nvlink_marketing_version() to handle
NVLink 6.0 raw NVML enum value from NVIDIA Rubin platform.
Also adds descriptive comments to existing version mapping cases.
Probe NVLink version before link state loop so "supported but no
bridge" is detected. Display shows "NVL3 0x" for GPUs with NVLink
hardware but no bridge connected. Layout compaction only applies
when active links are present (0-link display needs no padding
reduction). Adds any_device_has_nvlink_active flag to distinguish
NVLink hardware support from active connections.
…ctive links

When NVLink is supported but no bridge connected, the NVLink info window
is still allocated on line 2 (displaying 'NVL3 0x'). The old code only
expanded the panel width when links were active, causing the NVLink
window to overflow the panel boundary in the 0-link case.

Now device_length() checks any_device_has_nvlink (not
any_device_has_nvlink_active) to include the NVLink window width in the
panel calculation. Fan field padding (11 chars) is preserved since no
throughput display is needed.
nvtop_probe_nvlink_list() correctly sets any_device_has_nvlink_active=false
when NVLink hardware is present but no links are active. But nvtop_set_nvlink_probe()
then blindly overwrites it with the return value (true), destroying the distinction.

This caused fan field to shrink to 8 chars even with 0 active links, making
line 3 bar charts (GPU/MEM/Enc/Dec) expand incorrectly.
Panel width (device_length) controls all rows including line 3 bar charts.
Expanding it for the 0-link case was making GPU/MEM/Enc/Dec bars too wide.

Now panel width only expands when any_device_has_nvlink_active (actual links
with throughput to display). For the 0-link "NVL3 0x" case, the NVLink window
extends past the nominal panel edge which is fine — ncurses handles overlapping
windows correctly and line 3 bars stay at proper width.
…any_device_has_nvlink

For NVLink-supported GPUs with 0 active links (no bridge), the fan field was
using compact format ("FAN %3u%%") instead of the upstream padded format
(" FAN %3u%%  "). Changed all three fan format conditionals from
any_device_has_nvlink to any_device_has_nvlink_active so the 0-link case
preserves the standard spacing and field width.
…n non-NVIDIA GPUs

nvtop_get_nvlink_info(), nvtop_get_nvlink_error_counts(), and
nvtop_reset_nvlink_cache() use container_of() to cast gpu_info to
gpu_info_nvidia. On a non-NVIDIA device this is undefined behavior.

Add a strcmp() guard at the top of each function to return early
for non-NVIDIA GPUs. This avoids the unsafe cast entirely and makes
the code correct for mixed-vendor or NVIDIA-free systems.
Line 3 bar charts (GPU/MEM/Enc/Dec) were 6 chars too wide with NVLink
bridge installed. device_length() expanded panel to 90 (line2 with
pcie field) instead of 84 (base layout). NVLink window on line 2 can
extend past nominal panel edge — ncurses handles it fine, same as the
0-link case. Reverting to base layout keeps line 3 bar charts at the
correct width.
Three related fixes:

1. fan N/A fallback (line 912) uses any_device_has_nvlink_active instead
   of any_device_has_nvlink for consistent layout compaction.

2. Reset any_device_has_nvlink_active in interface_check_monitored_gpu_change()
   alongside any_device_has_nvlink to prevent stale flags from causing
   incorrect nvlink_errors window allocation during window rebuild.

3. Reset fan field width to 11 in interface_check_monitored_gpu_change()
   so initialize_curses() allocates fan_speed windows at the correct
   default width after a monitored-set change.
Upstream: Syllo/nvtop (commit 095d91c "Remove unused function in ixml")
Fork: danbedford/nvtop, branch `nvlink`
GPU Tested: `NVIDIA GeForce RTX 3090`
Scope: 5 files changed, 706 insertions(+), 19 deletions(-)

---

## Overview

Extends nvtop with per-GPU NVLink info in unused space of the existing interface. When no NVLink-connected GPU is detected, layout and behavior are identical to upstream -- no visual or functional difference. The goal is to bring useful data and throughput to all users of nvtop with NVLink-supported hardware, from consumer (2080, 3090 series) to datacenter (Ampere, Hopper, Blackwell series).

### Main bar (row 2, shown by default)

Appended at end after `power_info` -- NVLink version, link count, and aggregate throughput displayed. Two display states:

NVLink supported device - No bridge or no active links (0-link case, no row 2 padding compaction applied):

    NVL5 0x

With active links (row 2 padding compaction applied, throughput displayed). Example (theoretical fully saturated GB200 with NVLink 5.0):

    NVL518x 1.636 TiB/s

When NVLink is supported but no bridge is connected or links are inactive, only the version and link count display -- no compaction is applied to reclaim space on row 2 since there is no throughput to display. The `NVL5 0x` text extends past the panel edge without affecting the layout. Only when active links exist does fan field compaction kick in (11 to 8 characters) to make room for the throughput value.

- **Label**: `NVL` to represent minimal Label for NVLink.
- **Version**: Marketing NVLink version via `nvmlDeviceGetNvLinkVersion` (raw NVML enum values require remapping):
  - Raw 1 -> NVLink V1.0 -> Display 1
  - Raw 2 -> V2.0 -> 2
  - Raw 3 -> V2.2 -> 2
  - Raw 4 -> V3.0 -> 3
  - Raw 5 -> V3.1 -> 3
  - Raw 6 -> V4.0 -> 4
  - Raw 7 -> V5.0 -> 5
  - Raw 8 -> V6.0 -> 6 (assumed Rubin)
  Display shows single-digit major version due to limited space.
- **Link count**: Total physical links on the device (static hardware property). Maximum is 36 to future-proof for planned Nvidia Rubin.
- **Throughput**: Aggregate Transmit plus Receive utilization, currently read via `nvidia-smi` CLI fallback for all NVLink-connected GPUs. This carries measurable overhead from forking a full binary and parsing its text output, but providing real throughput visibility to consumer GPU users outweighs the cost, and all other non-NVLink users are isolated. The 2-second interval is hardcoded and independent of global nvtop refresh rate to cap CLI calls regardless of display speed. Uses "r" (raw) counters which include payload plus protocol overhead, reflecting true bandwidth utilization. Parses "Link N: Raw Tx: NNNNN KiB" / "Raw Rx" per link. Delta = `(current - previous) / time_delta` per link, summed for aggregate; unsigned underflow guard checks `new >= old` before subtraction. No smoothing applied -- raw accuracy over display smoothness. **TODO:** On datacenter GPUs with `nvmlDeviceGetNvLinkUtilizationCounter`, replace with direct API call; keep CLI fallback for consumer GPUs.
- **Layout compaction**: The Fan field shrinks from 11 to 8 characters ONLY when `any_device_has_nvlink_active` is true (at least one monitored GPU has active NVLink links). GPUs with NVLink hardware but no bridge (0-link case) do NOT get compaction -- `NVL3 0x` extends past the panel edge without needing reclaimed space. Panel width is determined by device name length (`device_name` column = `largest_device_name + 11`), so longer names produce more room for NVLink.
- **Throughput display**: Uses `print_data_at_scale()` (renamed from `print_pcie_at_scale()`) with IEC binary prefixes. Array bounds check extended from `< 5` to `< 6` to support up to Terbibytes/s (TiB/s) for Blackwell NVLink 5.0 devices at ~1.636 TiB/s aggregate. The `memory_prefix[]` array already contains entries up to "Pi" -- only the loop guard needed updating.

### Extra GPU info bar (row 4, not shown by default)

Appended at end after `exec_engines` -- error and correction counters since nvtop launch. Example with zeroed counters:

    NVL E:00000 C:00000

Example with non-zero counters (errors in red, corrections in yellow):

    NVL E:00420 C:00069

- **Label**: `NVL` to represent minimal Label for NVLink.
- **Error counters**: Replay, recovery, CRC FLIT, and CRC DATA errors via `nvmlDeviceGetNvLinkErrorCounter`, summed across all links. Baseline subtraction ensures counters start at zero on nvtop launch.
- **CRC corrections**: Per-lane CRC flit corrections via `nvmlDeviceGetFieldValues` (field IDs 32-247 for links 0-35), summed across all links. Uses modern signature `(device, valuesCount, fieldValues)` with field IDs populated in-place in the `nvmlFieldValue_t` buffer (48 bytes on NVML 11.515+: fieldId at offset 0, scopeId at 4, timestamp at 8, latencyUsec at 16, valueType at 24, nvmlReturn at 28, value.union at 32). Offsets are manually parsed since `nvml.h` is not exposed in the nvtop build.
- Error counters read during refresh cycle (`gpuinfo_nvidia_refresh_dynamic_info()`), not during startup probe (`nvtop_probe_nvlink_list()` calls `nvtop_get_nvlink_info()` before display is drawn). This ensures the baseline is established at the moment of first display refresh, guaranteeing counters read zero on launch. `nvtop_get_nvlink_info()` does NOT read error counters in the display path.

---

## Files Changed

### include/nvtop/extract_gpuinfo_common.h (+31 lines, -1 line)

- `NVTOP_NVLINK_MAX_LINKS` defined to 36
- Flat struct `nvlink_info`: `num_links`, `version`, `supported`, `has_throughput`, `aggregate_tx`, `aggregate_rx`, `total_errors`, `total_corrections`
- `nvtop_get_nvlink_info()`: return cached NVLink data; vendor guard skips non-NVIDIA GPUs before `container_of()`
- `nvtop_get_nvlink_error_counts()`: public getter for display-ready error/correction counts; bridges `interface.c` to per-device error state in `extract_gpuinfo_nvidia.c`
- `nvtop_probe_nvlink_list()`: probe all devices for NVLink support before curses init; short-circuits if `any_device_has_nvlink` already true
- `nvtop_set_nvlink_probe()`: set `any_device_has_nvlink` global flag only (leaves `any_device_has_nvlink_active` untouched)
- `nvtop_reset_nvlink_cache()`: reset all per-device NVLink caching (probe flag, cached linkcount, cached version, cached info struct) on monitored GPU set change; vendor guard for non-NVIDIA

### include/nvtop/interface_internal_common.h (+4 lines, -1 line)

- `WINDOW *nvlink_info` added to `struct device_window` (row 2 throughput)
- `WINDOW *nvlink_errors` added to `struct device_window` (row 4 errors)
- `device_nvlink_errors` added to `enum device_field` with size 19

### src/extract_gpuinfo_nvidia.c (+451 insertions, -1 deletion)

- Four NVML (NVIDIA Management Library) symbols via `dlsym()`: `nvmlDeviceGetNvLinkState`, `nvmlDeviceGetNvLinkVersion`, `nvmlDeviceGetNvLinkErrorCounter`, `nvmlDeviceGetFieldValues` (modern 3-param signature)
- Per-device state: `device_index`, `cli_poll_active`, per-link CLI counters, baseline/display error fields, probe cache (`nvlink_probed`, `nvlink_cached_linkcount`, `nvlink_cached_version`), full struct cache (`cached_nvlink_info`, `cached_nvlink_info_populated`)
- Link discovery: probes links 0-35 via `nvmlDeviceGetNvLinkState`, counts consecutive successes, stops on first hard error or `NVML_ERROR_NOT_SUPPORTED`; only active links (`isActive == 1`) are counted -- physical slots with no bridge are excluded
- Caching: 3 layers -- (1) link count/version via `nvlink_probe_and_cache()`, (2) full struct via `nvlink_refresh_cached_info()`, (3) list-level probe short-circuit in `nvtop_probe_nvlink_list()`; all reset by `nvtop_reset_nvlink_cache()` on GPU set change
- Throughput: `nvidia-smi nvlink --getthroughput r -i <dev>` every 2 seconds (hardcoded, independent of display refresh rate), delta-based rate computation with unsigned underflow guard
- Error reading via `nvlink_read_errors()`: called from `gpuinfo_nvidia_refresh_dynamic_info()` (not `nvtop_get_nvlink_info()`) to ensure baseline is established at first display refresh; reads errors via `nvmlDeviceGetNvLinkErrorCounter` and corrections via `nvmlDeviceGetFieldValues`; unsigned underflow guard prevents counter wrap artifacts

### src/interface.c (+220 insertions, -20 deletions)

- Conditional layout: `any_device_has_nvlink` controls window allocation; `any_device_has_nvlink_active` controls fan compaction (shrinks from 11 to 8 chars only when active links exist -- 0-link devices do not get compaction)
- `device_length()` always uses base layout (clock + mem_clock + temp + fan + power + 5) regardless of NVLink state; NVLink window on line 2 extends past nominal panel edge, which ncurses handles gracefully
- `nvtop_adjust_field_sizes_for_nvlink()` checks `any_device_has_nvlink_active` (not `any_device_has_nvlink`) for fan compaction
- `interface_check_monitored_gpu_change()` resets ALL mutable NVLink state: both global flags plus `sizeof_device_field[device_fan_speed] = 11`, then calls per-device `nvtop_reset_nvlink_cache()`
- Fan N/A fallback branch uses `any_device_has_nvlink_active` for correct 11-character format on 0-link devices
- NVLink info window (row 2): displays `print_data_at_scale()`-formatted throughput (renamed from `print_pcie_at_scale()`; bounds check extended to `< 6` for TiB/s ceiling)
- NVLink errors window (row 4): reads via `nvtop_get_nvlink_error_counts()` (does NOT call `nvtop_get_nvlink_info()` in display path)
- Memory leak fixes: added missing `delwin()` for `shader_cores`, `l2_cache_size`, `exec_engines`, `plots[i].plot_window`, and `nvlink_errors`. Two of these are also submitted as standalone upstream PRs: `free_device_windows()` fix (PR Syllo#467) and `plots[i].plot_window` fix (PR Syllo#468).

### src/nvtop.c (+5 lines, -1 line)

- `nvtop_probe_nvlink_list()` and `nvtop_set_nvlink_probe()` called before curses init (first layout pass)
- Re-evaluated in main loop after `interface_check_monitored_gpu_change()` for GPU hotplug

---

## Design Decisions

### Flat struct over nested

Single struct per device. Error and correction counters are cumulative totals (unsigned long long) summed across all links. Avoids per-link arrays and dynamic allocation in the hot refresh path.

### Two-tier error state: baseline plus display

Five fields in `struct gpu_info_nvidia` track error state: `baseline_errors`, `baseline_corrections`, `nvlink_errors_baseline_read` (bool), `display_errors`, `display_corrections`. Baselines persist for the entire process lifetime. Display values computed each refresh as `cumulative - baseline`.

### total_errors / total_corrections retained for API compatibility

Populated from `display_errors`/`display_corrections` in `nvlink_refresh_cached_info()`. Primary display path uses `nvtop_get_nvlink_error_counts()`, but both carry the same data.

### No new dependencies

Uses only NVML symbols already in `nvidia-ml` driver and `nvidia-smi` binary already on the system.

---

## What Was Not Changed

Process listing, memory/GPU charts, configuration options, keyboard shortcuts, menu behavior, and all non-NVLink display fields remain identical to upstream.

---

## Testing

Dual RTX 3090 Founders Edition 24GB with 3-slot NVLink Bridge (RTXA6000NVLINK3S-KIT). Displays in UI as: `NVIDIA GeForce RTX 3090`. 4 physical links per GPU. `enum nvmlNvlinkVersion_t` returns `5` representing NVLink v3.1. When idle, NVLink shows ~1.2 MiB/s aggregate residual throughput from assumed protocol keep-alives/link maintenance. Errors/corrections correctly display `E:00000 C:00000` on every launch, incrementing only when new errors occur (no errors experienced to fully confirm).
Copy link
Copy Markdown
Contributor Author

@danbedford danbedford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up the comments, all good here. 😁

Comment thread src/extract_gpuinfo_nvidia.c Outdated
if (nvmlDeviceGetFieldValues) {
for (unsigned int link = 0; link < linkCount; link++) {
int base_field_id = 32 + link * 6;
char raw[6 * NVM_LVALUE_SIZE];
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please declare and use a struct nvmlFieldValue_t instead of doing these raw bytes copy shenanigans.

The fact that it may change in the future is ok (usually the changes are backward compatible if implemented correctly by NVIDIA and the nvml library has had a pretty good record on doing exactly this).

Field layout is platform/target dependent so it's not a good idea to hard code the offsets and use memcpy like this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recreated the nvmlFieldValue_t typedef along with its dependencies (nvmlValue_t, nvmlValueType_t, nvmlReturn_t). All field access now uses proper struct member syntax (.fieldId, .value.ullVal, .nvmlReturn) instead of byte offsets.

See commit 47a6cf8.

unsigned long long total_corrections; // Cumulative-since-launch CRC corrections across all links
};

unsigned nvtop_get_nvlink_info(struct gpu_info *gpu_info, struct nvlink_info *nvlink_info);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to define these functions in another file (fallback ones returning unsupported/false/0) for when nvtop is not being compiled with NVIDIA support enabled. Currently they are only available in extract_gpuinfo_nvidia.c.

I would advise creating a new .c file and compile it only when NVIDIA_SUPPORT is OFF

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Created src/nvlink_nvidia_disabled.c with 4 no-op stub functions:

  • nvtop_get_nvlink_info() — returns 0
  • nvtop_get_nvlink_error_counts() — returns false
  • nvtop_probe_nvlink_list() — returns false
  • nvtop_reset_nvlink_cache() — no-op

The stub file is wired into src/CMakeLists.txt in the else(NVIDIA_SUPPORT) branch so it is only compiled when NVIDIA support is disabled. You originally mentioned 5 functions in your review, but the 5th (nvtop_set_nvlink_probe) was removed entirely per your suggestion in Comment 7, so there are now 4.

See commit 666ffed.

Comment thread src/extract_gpuinfo_nvidia.c Outdated
memset(raw, 0, sizeof(raw));
for (int i = 0; i < 6; i++) {
unsigned int fid = (unsigned int)(base_field_id + i);
memcpy(raw + i * NVM_LVALUE_SIZE + NVM_LVALUE_FIELD_ID_OFF, &fid, sizeof(fid));
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the enum values of the field instead of this hard coded 32 + link * 6

Also from the same link there is a special value for fetching the error count for all lanes:

#define [NVML_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL](https://docs.nvidia.com/deploy/nvml-api/group__nvmlFieldValueEnums.html#group__nvmlFieldValueEnums_1g03448b3fc6f250afe4e70782a1e6ea2c) 38
    NVLink flow control CRC Error Counter total for all Lanes.

There seems to also be this for ECC errors:

#define [NVML_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL](https://docs.nvidia.com/deploy/nvml-api/group__nvmlFieldValueEnums.html#group__nvmlFieldValueEnums_1ge51d113266f33da0ca06bd85cc7b6818) 160
    NVLink data ECC Error Counter total for all Links.

Copy link
Copy Markdown
Contributor Author

@danbedford danbedford May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The hardcoded 32 + link * 6 pattern has been removed. The batched nvmlDeviceGetFieldValues call now uses official enum constants with #ifndef guards so it works with both older and newer driver headers:

  • NVML_FI_DEV_NVLINK_THROUGHPUT_RAW_TX (140)
  • NVML_FI_DEV_NVLINK_THROUGHPUT_RAW_RX (141)
  • NVML_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL (38)
  • NVML_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL (160)

I also followed your suggestion to add field 160 for ECC data errors. All 4 fields are queried in a single batch[4] call. Field 38 and field 160 both use scopeId=0 since they are already per-device aggregates across all lanes/links.

See commits 666ffed and a226ed7.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


ERRsAndCORs

Error counters display in "extra GPU info bar" updated with ECC errors. The display now shows:

  • FL (FLIT errors, non-zero red)
  • EE (ECC data errors, non-zero red)
  • CR (CRC corrections, non-zero yellow)

Comment thread src/extract_gpuinfo_nvidia.c Outdated
// Returns number of links parsed (0 on failure)
static unsigned nvlink_cli_get_throughput(int device_index, unsigned int link_count,
unsigned long long *tx_out, unsigned long long *rx_out) {
char cmd[256];
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this CLI call and parsing.

Query nvmlDeviceGetFieldValues with the following IDs:

#define NVML_FI_DEV_NVLINK_THROUGHPUT_DATA_RX 139

#define NVML_FI_DEV_NVLINK_THROUGHPUT_DATA_TX 138

    NVLink throughput counters field values

    Link ID needs to be specified in the scopeId field in [nvmlFieldValue_t](https://docs.nvidia.com/deploy/nvml-api/structnvmlFieldValue__t.html#structnvmlFieldValue__t). A scopeId of UINT_MAX returns aggregate value summed up across all links for the specified counter type in fieldId.

Also you can probably do a single call to nvmlDeviceGetFieldValues for the RX, TX, and error counts). The struct has a nvmlReturn field to check if the requested values are valid. If not we just report nothing as usual.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. All popen/pclose CLI code has been removed. Throughput and error counters are now read via a single batched nvmlDeviceGetFieldValues call with 4 fields (RAW TX, RAW RX, CRC corrections via field 38, and ECC data errors via field 160). Fields returning NVML_ERROR_NOT_SUPPORTED (e.g. throughput on consumer GPUs) are silently skipped.

See commit 666ffed.

// Called from refresh_dynamic_info on every refresh cycle (refresh path).
// GPUs are non-hot-swappable, so all NVLink data is computed here and cached —
// nvtop_get_nvlink_info() in the draw path just returns the cached copy.
static void nvlink_refresh_cached_info(struct gpu_info_nvidia *gpu_info, unsigned int linkCount) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this along with the CLI fallback (see previous comment in this file)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The entire nvlink_cli_get_throughput() function and all associated struct fields have been removed: device_index, cli_poll_active, per-link CLI counters (nvlink_cli_tx[], nvlink_cli_rx[]), aggregate CLI values (cli_agg_tx, cli_agg_rx), and last_nvlink_cli_time.

See commit 666ffed.

Comment thread src/interface.c Outdated
Comment on lines +82 to +83
struct nvlink_info nvl;
memset(&nvl, 0, sizeof(nvl));
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct nvlink_info nvl;
memset(&nvl, 0, sizeof(nvl));
struct nvlink_info nvl = {0};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. All struct nvlink_info declarations now use = {0} initializer:

  • nvtop_probe_nvlink_list() in interface.c (line 82)
  • draw_devices() in interface.c (line 950)

The old pattern of separate declaration followed by memset has been removed. The same {0} initializer is also applied to nvmlFieldValue_t declarations in extract_gpuinfo_nvidia.c (the batch[4] array and any single-field queries).

See commits b97dac8 and 666ffed.

Comment thread src/interface.c
}
}

bool nvtop_probe_nvlink_list(struct list_head *devices) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function calls nvtop_adjust_field_sizes_for_nvlink, which is used by initialize_all_windows (inside initialize_all_windows).

I think it would be more fitting to call this inside initialize_all_windows instead of the spread out fixes that are required calling it in nvtop.c

That way nvtop_set_nvlink_probe is not needed anymore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. nvtop_adjust_field_sizes_for_nvlink() is now called at the top of initialize_all_windows() in interface.c. nvtop_set_nvlink_probe() has been removed entirely — both call sites in nvtop.c now invoke nvtop_probe_nvlink_list() directly.

See commit 666ffed.

Replace CLI-based NVLink throughput with NVML API and refactor
probe/layout initialization per maintainer feedback.

Comment Syllo#1: include nvml.h, remove local typedefs, add #ifndef
guards for enum constants, update dlsym function pointer to use
proper nvmlFieldValue_t type, remove raw memcpy offset macros.

Comment Syllo#2: wire nvlink_nvidia_disabled.c stub file into
CMakeLists.txt else-branch for non-NVIDIA builds.

Comment Syllo#3: remove per-lane CRC corrections loop from
nvlink_read_errors() (Phase 2) - corrections now read in batched
call in nvlink_refresh_cached_info().

Comment Syllo#4: replace nvidia-smi CLI fallback with single batched
nvmlDeviceGetFieldValues call for RAW TX (140), RAW RX (141), and
CRC corrections (38). Use scopeId=UINT_MAX for throughput
aggregate, scopeId=0 for per-device corrections.

Comment Syllo#5: remove nvlink_cli_get_throughput() function and CLI
struct fields (device_index, cli_poll_active, nvlink_cli_tx/rx,
last_nvlink_cli_time, cli_agg_tx/rx). Replace with nvlink_last_tx,
nvlink_last_rx, nvlink_last_poll_time.

Comment Syllo#6: use struct nvlink_info nvl = {0} initializer in
nvtop_probe_nvlink_list().

Comment Syllo#7: move nvtop_adjust_field_sizes_for_nvlink() into
initialize_all_windows(), remove nvtop_set_nvlink_probe() entirely,
swap probe and interface_check_monitored_gpu_change() call order in
nvtop.c main loop, add re-probe in monitored-set-change handler.
danbedford added 4 commits May 6, 2026 20:57
nvml.h cannot be included directly — nvtop uses dlsym function pointers
for all NVML functions, and including nvml.h would conflict with 373
function prototypes and 12 struct/enum typedefs.

Instead, manually declare nvmlFieldValue_t and its dependencies
(nvmlValue_t, nvmlValueType_t, nvmlReturn_t, nvmlDevice_t) inline.
This satisfies the maintainer requirement to use the proper struct type
instead of raw memcpy offsets, without breaking the dlsym architecture.

Also removes the unused find_path(NVML_INCLUDE_DIR) from CMakeLists.txt
and fixes the forward declaration of nvlink_read_errors to use
nvmlDevice_t instead of struct nvmlDevice*.
Per maintainer suggestion in PR Syllo#469 Comment 3, add field 160
(NVML_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL) to the existing
batched nvmlDeviceGetFieldValues call alongside throughput and CRC
corrections.

- Batch expanded from 3 to 4 fields (scopeId=0, per-device aggregate)
- Added total_ecc_errors to struct nvlink_info
- Added baseline_ecc_errors/display_ecc_errors to struct gpu_info_nvidia
- Display format: NVL E:00000 C:00000 X:00000 (window width 19->28)
- Updated nvtop_get_nvlink_error_counts() to return ECC count
- Updated stub file and cache reset accordingly
- FL: FLIT errors (red if >0)
- EE: ECC data errors (red if >0)
- CR: CRC corrections (yellow if >0)
- Errors grouped together (FL/EE), corrections last (CR)
- Window width expanded from 28 to 31 chars
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants