fix: restore gpu info gpus OTLP attribute by jingxiang-z · Pull Request #201 · NVIDIA/fleet-intelligence-agent

jingxiang-z · 2026-05-21T21:12:51Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Summary by CodeRabbit

New Features
- Added optional machine information collection and caching for health data exports
- GPU details now available as resource attributes in exported metrics when enabled
- Configurable via optional parameters during initialization

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

coderabbitai · 2026-05-21T21:13:06Z

📝 Walkthrough

Walkthrough

This PR adds optional cached machine-info collection to the health exporter. The collector gains an options system and conditional NVML-backed machine-info provider. Runtime collection reads from the cache and asynchronously refreshes. OTLP export serializes GPU info to resource attributes. NVML instance is injected through exporter options and threaded via server initialization.

Changes

Cached Machine-Info Collection

Layer / File(s)	Summary
Collector machine-info infrastructure and initialization `internal/exporter/collector/collector.go`	Collector struct extends to hold exporter config and machineInfoProvider. New options system introduced (Option type, WithNVMLInstance) allows NVML instance injection. New() conditionally creates and starts a cached machine-info provider when IncludeMachineInfo is enabled. Imports and constants establish NVML query support.
Machine-info collection at runtime `internal/exporter/collector/collector.go`	Collect() conditionally calls collectMachineInfo() when IncludeMachineInfo is enabled. collectMachineInfo() reads from the cached provider with best-effort waiting for initial readiness, assigns data.MachineInfo when available, and triggers asynchronous refresh for subsequent exports.
Machine-info collection test and test double `internal/exporter/collector/collector_test.go`	Unit test validates collectMachineInfo uses injected machineInfoProvider cache, verifying pointer identity of cached MachineInfo and RefreshAsync/WaitForInitialRefresh call behaviors. fakeMachineInfoProvider test double implements the provider interface with preset MachineInfo and boolean tracking of method calls.

OTLP GPU Info Export

Layer / File(s)	Summary
GPU info OTLP resource attribute serialization `internal/exporter/converter/otlp.go`	createOTLPResource conditionally appends gpuInfo.gpus OTLP resource attribute by JSON-marshaling data.MachineInfo.GPUInfo.GPUs when present and non-empty. Attribute omitted on empty GPU list or JSON marshaling failure.
GPU info OTLP serialization validation `internal/exporter/converter/otlp_test.go`	Test constructs HealthData with single GPU in MachineInfo.GPUInfo and validates gpuInfo.gpus attribute serializes to expected JSON array with only per-GPU fields. Verifies GPU-level metadata keys (product, manufacturer, architecture, memory) are excluded from resource attribute keys.

NVML Instance Dependency Injection

Layer / File(s)	Summary
Exporter options extension with NVML instance `internal/exporter/options.go`	Exporter options adds NVML import. exporterOptions struct extended with nvmlInstance field. WithNVMLInstance() exported option function allows callers to inject NVML instance for machine-info collection.
Exporter and server wiring of NVML instance `internal/exporter/exporter.go`, `internal/server/server.go`	Exporter.New() passes options.nvmlInstance into collector.New() via collector.WithNVMLInstance(). Server.New() includes exporter.WithNVMLInstance(nvmlInstance) in exporter construction, threading NVML instance from initialization through dependency chain.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit weaves NVML's thread,
Through layers of cache and export bed,
GPU info flows in JSON streams,
Conditional collection, async dreams,
Options injected, neat and true! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: restore gpu info gpus OTLP attribute' clearly and specifically describes the main change: restoring GPU info GPUs to OTLP attributes, which aligns with the primary functionality added across the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/restore-gpuinfo-gpus-otlp

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

internal/exporter/collector/collector_test.go (1)
628-646: 💤 Low value

Consider adding a test for the cache-not-ready branch.

The current test covers the happy path (ok: true). For completeness, consider adding a test where ok starts as false to verify WaitForInitialRefresh is invoked, then transitions to true after waiting. This would validate the full cache-warming flow.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/exporter/collector/collector_test.go` around lines 628 - 646, Add a
test that exercises the cache-not-ready branch by instantiating
fakeMachineInfoProvider with ok=false, passing it into the collector setup used
in the existing happy-path test, and verifying that collector calls RefreshAsync
and WaitForInitialRefresh on the provider; use a goroutine or timer to flip f.ok
to true after a short delay so the collector proceeds and ensure f.waited and
f.refreshed become true and the final collection succeeds. Reference the
fakeMachineInfoProvider type and its methods Get, RefreshAsync, and
WaitForInitialRefresh when locating where to add the test and assertions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 628-646: Add a test that exercises the cache-not-ready branch by
instantiating fakeMachineInfoProvider with ok=false, passing it into the
collector setup used in the existing happy-path test, and verifying that
collector calls RefreshAsync and WaitForInitialRefresh on the provider; use a
goroutine or timer to flip f.ok to true after a short delay so the collector
proceeds and ensure f.waited and f.refreshed become true and the final
collection succeeds. Reference the fakeMachineInfoProvider type and its methods
Get, RefreshAsync, and WaitForInitialRefresh when locating where to add the test
and assertions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d3a33a40-cdc6-4855-b505-6705735d76f8

📥 Commits

Reviewing files that changed from the base of the PR and between 9dda598 and 9a77797.

📒 Files selected for processing (7)

internal/exporter/collector/collector.go
internal/exporter/collector/collector_test.go
internal/exporter/converter/otlp.go
internal/exporter/converter/otlp_test.go
internal/exporter/exporter.go
internal/exporter/options.go
internal/server/server.go

mukilsh

LGTM

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

fix: restore gpu info gpus OTLP attribute

9a77797

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

jingxiang-z requested a review from mukilsh May 21, 2026 21:18

jingxiang-z self-assigned this May 21, 2026

mukilsh approved these changes May 21, 2026

View reviewed changes

jingxiang-z merged commit 1e96294 into main May 21, 2026
9 checks passed

jingxiang-z deleted the fix/restore-gpuinfo-gpus-otlp branch May 21, 2026 22:23

jingxiang-z added a commit that referenced this pull request May 21, 2026

fix: restore gpu info gpus OTLP attribute (#201) (#202)

c3b083d

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: restore gpu info gpus OTLP attribute#201

fix: restore gpu info gpus OTLP attribute#201
jingxiang-z merged 1 commit into
mainfrom
fix/restore-gpuinfo-gpus-otlp

jingxiang-z commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

mukilsh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jingxiang-z commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mukilsh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jingxiang-z commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading