Skip to content

fix: restore gpu info gpus OTLP attribute#201

Merged
jingxiang-z merged 1 commit into
mainfrom
fix/restore-gpuinfo-gpus-otlp
May 21, 2026
Merged

fix: restore gpu info gpus OTLP attribute#201
jingxiang-z merged 1 commit into
mainfrom
fix/restore-gpuinfo-gpus-otlp

Conversation

@jingxiang-z
Copy link
Copy Markdown
Collaborator

@jingxiang-z jingxiang-z commented May 21, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features
    • Added optional machine information collection and caching for health data exports
    • GPU details now available as resource attributes in exported metrics when enabled
    • Configurable via optional parameters during initialization

Review Change Stack

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

This PR adds optional cached machine-info collection to the health exporter. The collector gains an options system and conditional NVML-backed machine-info provider. Runtime collection reads from the cache and asynchronously refreshes. OTLP export serializes GPU info to resource attributes. NVML instance is injected through exporter options and threaded via server initialization.

Changes

Cached Machine-Info Collection

Layer / File(s) Summary
Collector machine-info infrastructure and initialization
internal/exporter/collector/collector.go
Collector struct extends to hold exporter config and machineInfoProvider. New options system introduced (Option type, WithNVMLInstance) allows NVML instance injection. New() conditionally creates and starts a cached machine-info provider when IncludeMachineInfo is enabled. Imports and constants establish NVML query support.
Machine-info collection at runtime
internal/exporter/collector/collector.go
Collect() conditionally calls collectMachineInfo() when IncludeMachineInfo is enabled. collectMachineInfo() reads from the cached provider with best-effort waiting for initial readiness, assigns data.MachineInfo when available, and triggers asynchronous refresh for subsequent exports.
Machine-info collection test and test double
internal/exporter/collector/collector_test.go
Unit test validates collectMachineInfo uses injected machineInfoProvider cache, verifying pointer identity of cached MachineInfo and RefreshAsync/WaitForInitialRefresh call behaviors. fakeMachineInfoProvider test double implements the provider interface with preset MachineInfo and boolean tracking of method calls.

OTLP GPU Info Export

Layer / File(s) Summary
GPU info OTLP resource attribute serialization
internal/exporter/converter/otlp.go
createOTLPResource conditionally appends gpuInfo.gpus OTLP resource attribute by JSON-marshaling data.MachineInfo.GPUInfo.GPUs when present and non-empty. Attribute omitted on empty GPU list or JSON marshaling failure.
GPU info OTLP serialization validation
internal/exporter/converter/otlp_test.go
Test constructs HealthData with single GPU in MachineInfo.GPUInfo and validates gpuInfo.gpus attribute serializes to expected JSON array with only per-GPU fields. Verifies GPU-level metadata keys (product, manufacturer, architecture, memory) are excluded from resource attribute keys.

NVML Instance Dependency Injection

Layer / File(s) Summary
Exporter options extension with NVML instance
internal/exporter/options.go
Exporter options adds NVML import. exporterOptions struct extended with nvmlInstance field. WithNVMLInstance() exported option function allows callers to inject NVML instance for machine-info collection.
Exporter and server wiring of NVML instance
internal/exporter/exporter.go, internal/server/server.go
Exporter.New() passes options.nvmlInstance into collector.New() via collector.WithNVMLInstance(). Server.New() includes exporter.WithNVMLInstance(nvmlInstance) in exporter construction, threading NVML instance from initialization through dependency chain.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit weaves NVML's thread,
Through layers of cache and export bed,
GPU info flows in JSON streams,
Conditional collection, async dreams,
Options injected, neat and true! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: restore gpu info gpus OTLP attribute' clearly and specifically describes the main change: restoring GPU info GPUs to OTLP attributes, which aligns with the primary functionality added across the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/restore-gpuinfo-gpus-otlp

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/exporter/collector/collector_test.go (1)

628-646: 💤 Low value

Consider adding a test for the cache-not-ready branch.

The current test covers the happy path (ok: true). For completeness, consider adding a test where ok starts as false to verify WaitForInitialRefresh is invoked, then transitions to true after waiting. This would validate the full cache-warming flow.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/exporter/collector/collector_test.go` around lines 628 - 646, Add a
test that exercises the cache-not-ready branch by instantiating
fakeMachineInfoProvider with ok=false, passing it into the collector setup used
in the existing happy-path test, and verifying that collector calls RefreshAsync
and WaitForInitialRefresh on the provider; use a goroutine or timer to flip f.ok
to true after a short delay so the collector proceeds and ensure f.waited and
f.refreshed become true and the final collection succeeds. Reference the
fakeMachineInfoProvider type and its methods Get, RefreshAsync, and
WaitForInitialRefresh when locating where to add the test and assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 628-646: Add a test that exercises the cache-not-ready branch by
instantiating fakeMachineInfoProvider with ok=false, passing it into the
collector setup used in the existing happy-path test, and verifying that
collector calls RefreshAsync and WaitForInitialRefresh on the provider; use a
goroutine or timer to flip f.ok to true after a short delay so the collector
proceeds and ensure f.waited and f.refreshed become true and the final
collection succeeds. Reference the fakeMachineInfoProvider type and its methods
Get, RefreshAsync, and WaitForInitialRefresh when locating where to add the test
and assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d3a33a40-cdc6-4855-b505-6705735d76f8

📥 Commits

Reviewing files that changed from the base of the PR and between 9dda598 and 9a77797.

📒 Files selected for processing (7)
  • internal/exporter/collector/collector.go
  • internal/exporter/collector/collector_test.go
  • internal/exporter/converter/otlp.go
  • internal/exporter/converter/otlp_test.go
  • internal/exporter/exporter.go
  • internal/exporter/options.go
  • internal/server/server.go

@jingxiang-z jingxiang-z requested a review from mukilsh May 21, 2026 21:18
@jingxiang-z jingxiang-z self-assigned this May 21, 2026
Copy link
Copy Markdown
Contributor

@mukilsh mukilsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jingxiang-z jingxiang-z merged commit 1e96294 into main May 21, 2026
9 checks passed
@jingxiang-z jingxiang-z deleted the fix/restore-gpuinfo-gpus-otlp branch May 21, 2026 22:23
jingxiang-z added a commit that referenced this pull request May 21, 2026
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants