Skip to content

feat: create nvml component#194

Merged
ambermingxin merged 2 commits into
mainfrom
feat/report-nvml-error
May 14, 2026
Merged

feat: create nvml component#194
ambermingxin merged 2 commits into
mainfrom
feat/report-nvml-error

Conversation

@ambermingxin
Copy link
Copy Markdown
Collaborator

@ambermingxin ambermingxin commented May 13, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Added NVIDIA NVML GPU health monitoring component to the platform.
    • Improved GPU attribute collection with best-effort approach to handle partial failures gracefully.
  • Tests

    • Added test coverage for NVML component health checks and partial NVML failure scenarios.

Review Change Stack

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

This pull request adds GPU health monitoring via a new NVML component and refactors GPU attribute collection to report errors without failing. The component receives accumulated GPU errors and transitions health state accordingly. Tests validate both component behavior and successful GPU collection despite per-attribute failures.

Changes

NVML Component and Best-Effort GPU Info Collection

Layer / File(s) Summary
NVML component implementation
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go
Introduces accelerator-nvidia-nvml component implementing components.Component interface with health state tracking. Stores NVML instance and cached results, registers globally to receive error updates via RunCheckWithErrors, marks health unhealthy when collection errors occur, and extracts entity IDs from error messages to build HealthStateIncident entries. Implements checkResult with health state reporting including JSON-encoded error metadata.
NVML component tests
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component_test.go
Tests error-to-health-state transitions with mocked NVML instance. Verifies component transitions to unhealthy with incident when RunCheckWithErrors receives GPU errors, and back to healthy when called with nil.
Best-effort GPU attribute collection
third_party/fleet-intelligence-sdk/pkg/machine-info/machine_info.go
Refactors GetMachineGPUInfo from fail-fast to best-effort error handling. Accumulates per-GPU collection failures (memory, serial, minor ID, board ID, PCI bus ID, VBIOS version, GPU index) using collectionErrors accumulator. Invokes componentnvml.RunCheckWithErrors(collectionErrors) after GPU iteration to process errors.
Machine info GPU collection tests
third_party/fleet-intelligence-sdk/pkg/machine-info/machine_info_test.go
Updates import dependencies for NVML device types. Adjusts ChassisSN assertions to be conditional on platform info support. Adds TestGetMachineGPUInfo_PartialNVMLFailureIsNonFatal with partial-failure mock NVML types that fail on platform info, memory, serial, and PCI bus ID calls, verifying GetMachineGPUInfo completes without error and returns GPU UUID despite per-attribute failures.
Registry component wiring
internal/registry/registry.go
Imports NVML component package and adds enabled-by-default entry to All() component registry.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A new component hops into the fold,
Tracking GPU health, brave and bold.
Errors collected, but GPUs stay strong,
Best-effort flows where failures belong!
From registry to health, the path is long. 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: create nvml component' accurately summarizes the main change: introducing a new NVML-backed accelerator component across the registry and supporting modules.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/report-nvml-error

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cd16af350

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go (1)

254-265: ⚡ Quick win

Consider adding direct test coverage for entity ID extraction.

The entity ID parsing logic depends on the exact message format from machine_info.go line 339 ("gpu %s: %s failed: %s"). While the current implementation is correct and tested indirectly through TestRunCheckWithErrors, adding a direct unit test for extractEntityID with various message formats would make the parsing contract more explicit and catch format mismatches earlier.

📝 Example test to add
func Test_extractEntityID(t *testing.T) {
	tests := []struct {
		message string
		want    string
	}{
		{"gpu GPU-123: get_memory failed: out of memory", "GPU-123"},
		{"gpu GPU-abc-456: get_serial failed: not supported", "GPU-abc-456"},
		{"invalid format", ""},
		{"gpu ", ""},
		{"gpu :", ""},
	}
	for _, tt := range tests {
		got := extractEntityID(tt.message)
		assert.Equal(t, tt.want, got, "message: %s", tt.message)
	}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go`
around lines 254 - 265, Add a direct unit test for the extractEntityID function
to explicitly validate parsing of different message formats; create
Test_extractEntityID that calls extractEntityID with messages like "gpu GPU-123:
get_memory failed: out of memory", "gpu GPU-abc-456: get_serial failed: not
supported", and invalid cases ("invalid format", "gpu ", "gpu :") and assert the
expected outputs (e.g. "GPU-123", "GPU-abc-456", and "" for invalid cases) so
the parsing contract in extractEntityID is covered independently of
TestRunCheckWithErrors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go`:
- Around line 254-265: Add a direct unit test for the extractEntityID function
to explicitly validate parsing of different message formats; create
Test_extractEntityID that calls extractEntityID with messages like "gpu GPU-123:
get_memory failed: out of memory", "gpu GPU-abc-456: get_serial failed: not
supported", and invalid cases ("invalid format", "gpu ", "gpu :") and assert the
expected outputs (e.g. "GPU-123", "GPU-abc-456", and "" for invalid cases) so
the parsing contract in extractEntityID is covered independently of
TestRunCheckWithErrors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 920783a9-bf79-4acd-8c9d-1a3b63e2c482

📥 Commits

Reviewing files that changed from the base of the PR and between 0a93e84 and 8cd16af.

📒 Files selected for processing (5)
  • internal/registry/registry.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component_test.go
  • third_party/fleet-intelligence-sdk/pkg/machine-info/machine_info.go
  • third_party/fleet-intelligence-sdk/pkg/machine-info/machine_info_test.go

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
@ambermingxin ambermingxin requested a review from mukilsh May 14, 2026 16:36
Copy link
Copy Markdown
Contributor

@mukilsh mukilsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left one comment!

Copy link
Copy Markdown
Contributor

@mukilsh mukilsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ambermingxin ambermingxin merged commit 8c48f2d into main May 14, 2026
8 checks passed
@ambermingxin ambermingxin deleted the feat/report-nvml-error branch May 14, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants