feat: create nvml component#194
Conversation
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
📝 WalkthroughWalkthroughThis pull request adds GPU health monitoring via a new NVML component and refactors GPU attribute collection to report errors without failing. The component receives accumulated GPU errors and transitions health state accordingly. Tests validate both component behavior and successful GPU collection despite per-attribute failures. ChangesNVML Component and Best-Effort GPU Info Collection
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8cd16af350
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
🧹 Nitpick comments (1)
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go (1)
254-265: ⚡ Quick winConsider adding direct test coverage for entity ID extraction.
The entity ID parsing logic depends on the exact message format from
machine_info.goline 339 ("gpu %s: %s failed: %s"). While the current implementation is correct and tested indirectly throughTestRunCheckWithErrors, adding a direct unit test forextractEntityIDwith various message formats would make the parsing contract more explicit and catch format mismatches earlier.📝 Example test to add
func Test_extractEntityID(t *testing.T) { tests := []struct { message string want string }{ {"gpu GPU-123: get_memory failed: out of memory", "GPU-123"}, {"gpu GPU-abc-456: get_serial failed: not supported", "GPU-abc-456"}, {"invalid format", ""}, {"gpu ", ""}, {"gpu :", ""}, } for _, tt := range tests { got := extractEntityID(tt.message) assert.Equal(t, tt.want, got, "message: %s", tt.message) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go` around lines 254 - 265, Add a direct unit test for the extractEntityID function to explicitly validate parsing of different message formats; create Test_extractEntityID that calls extractEntityID with messages like "gpu GPU-123: get_memory failed: out of memory", "gpu GPU-abc-456: get_serial failed: not supported", and invalid cases ("invalid format", "gpu ", "gpu :") and assert the expected outputs (e.g. "GPU-123", "GPU-abc-456", and "" for invalid cases) so the parsing contract in extractEntityID is covered independently of TestRunCheckWithErrors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.go`:
- Around line 254-265: Add a direct unit test for the extractEntityID function
to explicitly validate parsing of different message formats; create
Test_extractEntityID that calls extractEntityID with messages like "gpu GPU-123:
get_memory failed: out of memory", "gpu GPU-abc-456: get_serial failed: not
supported", and invalid cases ("invalid format", "gpu ", "gpu :") and assert the
expected outputs (e.g. "GPU-123", "GPU-abc-456", and "" for invalid cases) so
the parsing contract in extractEntityID is covered independently of
TestRunCheckWithErrors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 920783a9-bf79-4acd-8c9d-1a3b63e2c482
📒 Files selected for processing (5)
internal/registry/registry.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/nvml/component_test.gothird_party/fleet-intelligence-sdk/pkg/machine-info/machine_info.gothird_party/fleet-intelligence-sdk/pkg/machine-info/machine_info_test.go
Description
Checklist
Summary by CodeRabbit
New Features
Tests