feat: create nvml component by ambermingxin · Pull Request #194 · NVIDIA/fleet-intelligence-agent

ambermingxin · 2026-05-13T21:19:49Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Summary by CodeRabbit

New Features
- Added NVIDIA NVML GPU health monitoring component to the platform.
- Improved GPU attribute collection with best-effort approach to handle partial failures gracefully.
Tests
- Added test coverage for NVML component health checks and partial NVML failure scenarios.

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>

coderabbitai · 2026-05-13T21:20:00Z

📝 Walkthrough

Walkthrough

This pull request adds GPU health monitoring via a new NVML component and refactors GPU attribute collection to report errors without failing. The component receives accumulated GPU errors and transitions health state accordingly. Tests validate both component behavior and successful GPU collection despite per-attribute failures.

Changes

NVML Component and Best-Effort GPU Info Collection

Layer / File(s)	Summary
NVML component implementation `third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go`	Introduces `accelerator-nvidia-nvml` component implementing `components.Component` interface with health state tracking. Stores NVML instance and cached results, registers globally to receive error updates via `RunCheckWithErrors`, marks health unhealthy when collection errors occur, and extracts entity IDs from error messages to build `HealthStateIncident` entries. Implements `checkResult` with health state reporting including JSON-encoded error metadata.
NVML component tests `third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component_test.go`	Tests error-to-health-state transitions with mocked NVML instance. Verifies component transitions to unhealthy with incident when `RunCheckWithErrors` receives GPU errors, and back to healthy when called with nil.
Best-effort GPU attribute collection `third_party/fleet-intelligence-sdk/pkg/machine-info/machine_info.go`	Refactors `GetMachineGPUInfo` from fail-fast to best-effort error handling. Accumulates per-GPU collection failures (memory, serial, minor ID, board ID, PCI bus ID, VBIOS version, GPU index) using `collectionErrors` accumulator. Invokes `componentnvml.RunCheckWithErrors(collectionErrors)` after GPU iteration to process errors.
Machine info GPU collection tests `third_party/fleet-intelligence-sdk/pkg/machine-info/machine_info_test.go`	Updates import dependencies for NVML device types. Adjusts `ChassisSN` assertions to be conditional on platform info support. Adds `TestGetMachineGPUInfo_PartialNVMLFailureIsNonFatal` with partial-failure mock NVML types that fail on platform info, memory, serial, and PCI bus ID calls, verifying `GetMachineGPUInfo` completes without error and returns GPU UUID despite per-attribute failures.
Registry component wiring `internal/registry/registry.go`	Imports NVML component package and adds enabled-by-default entry to `All()` component registry.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A new component hops into the fold,
Tracking GPU health, brave and bold.
Errors collected, but GPUs stay strong,
Best-effort flows where failures belong!
From registry to health, the path is long. 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: create nvml component' accurately summarizes the main change: introducing a new NVML-backed accelerator component across the registry and supporting modules.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/report-nvml-error

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cd16af350

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

coderabbitai

🧹 Nitpick comments (1)

third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go (1)
254-265: ⚡ Quick win

Consider adding direct test coverage for entity ID extraction.

The entity ID parsing logic depends on the exact message format from machine_info.go line 339 ("gpu %s: %s failed: %s"). While the current implementation is correct and tested indirectly through TestRunCheckWithErrors, adding a direct unit test for extractEntityID with various message formats would make the parsing contract more explicit and catch format mismatches earlier.
📝 Example test to add
func Test_extractEntityID(t *testing.T) {
	tests := []struct {
		message string
		want    string
	}{
		{"gpu GPU-123: get_memory failed: out of memory", "GPU-123"},
		{"gpu GPU-abc-456: get_serial failed: not supported", "GPU-abc-456"},
		{"invalid format", ""},
		{"gpu ", ""},
		{"gpu :", ""},
	}
	for _, tt := range tests {
		got := extractEntityID(tt.message)
		assert.Equal(t, tt.want, got, "message: %s", tt.message)
	}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go`
around lines 254 - 265, Add a direct unit test for the extractEntityID function
to explicitly validate parsing of different message formats; create
Test_extractEntityID that calls extractEntityID with messages like "gpu GPU-123:
get_memory failed: out of memory", "gpu GPU-abc-456: get_serial failed: not
supported", and invalid cases ("invalid format", "gpu ", "gpu :") and assert the
expected outputs (e.g. "GPU-123", "GPU-abc-456", and "" for invalid cases) so
the parsing contract in extractEntityID is covered independently of
TestRunCheckWithErrors.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go`:
- Around line 254-265: Add a direct unit test for the extractEntityID function
to explicitly validate parsing of different message formats; create
Test_extractEntityID that calls extractEntityID with messages like "gpu GPU-123:
get_memory failed: out of memory", "gpu GPU-abc-456: get_serial failed: not
supported", and invalid cases ("invalid format", "gpu ", "gpu :") and assert the
expected outputs (e.g. "GPU-123", "GPU-abc-456", and "" for invalid cases) so
the parsing contract in extractEntityID is covered independently of
TestRunCheckWithErrors.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 920783a9-bf79-4acd-8c9d-1a3b63e2c482

📥 Commits

Reviewing files that changed from the base of the PR and between 0a93e84 and 8cd16af.

📒 Files selected for processing (5)

internal/registry/registry.go
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component_test.go
third_party/fleet-intelligence-sdk/pkg/machine-info/machine_info.go
third_party/fleet-intelligence-sdk/pkg/machine-info/machine_info_test.go

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>

mukilsh

Just left one comment!

mukilsh

LGTM

creating nvml component

8cd16af

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

Comment thread third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

fix lint

356eb8b

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>

ambermingxin requested a review from mukilsh May 14, 2026 16:36

mukilsh reviewed May 14, 2026

View reviewed changes

Comment thread third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go

mukilsh approved these changes May 14, 2026

View reviewed changes

Comment thread third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go

ambermingxin merged commit 8c48f2d into main May 14, 2026
8 checks passed

ambermingxin deleted the feat/report-nvml-error branch May 14, 2026 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: create nvml component#194

feat: create nvml component#194
ambermingxin merged 2 commits into
mainfrom
feat/report-nvml-error

ambermingxin commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

mukilsh left a comment

Uh oh!

Uh oh!

mukilsh left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ambermingxin commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mukilsh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mukilsh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ambermingxin commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading