Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughEnrichedIncident now stores DCGM numeric enums for ErrorCode, System, and Health; enrichment fills these typed fields and API-facing Severity/Error strings are derived on demand. Tests, component fixtures, OTLP conversion, and API types updated to use Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
…lity The backend DCGMIncident model expects integer values for code, system, and health fields, but EnrichedIncident was storing and serializing them as strings, causing json.Unmarshal to fail on the backend. - Change EnrichedIncident.ErrorCode to dcgm.HealthCheckErrorCode (uint) - Change EnrichedIncident.System to dcgm.HealthSystem (uint) - Replace EnrichedIncident.Severity with Health dcgm.HealthResult (uint) - EnrichIncidents/EnrichSwitchIncidents now store raw DCGM values - ToHealthStateIncident() converts back to strings for HealthState API - Add TestEnrichedIncidentJSON_SerializesIntegers to verify the contract - Update component tests to use DCGM constants instead of string literals Fixes GPUHEALTH-1721 Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
c3b7543 to
49d3889
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.go`:
- Around line 361-377: The test currently re-asserts parsed["code"],
parsed["system"], and parsed["health"] with type assertions that can panic;
change each block to perform a single safe type assertion (e.g., val, ok :=
parsed["code"].(float64)) and fail fast (t.Fatalf or t.Errorf then return) if
!ok, then compare the typed val to the expected constant
(dcgm.DCGM_FR_VOLATILE_SBE_DETECTED_TS, dcgm.DCGM_HEALTH_WATCH_MEM,
dcgm.DCGM_HEALTH_RESULT_WARN) using a non-panicking comparison; apply the same
pattern for "system" and "health" in utils_test.go so no direct re-assertions on
parsed[...] occur after the ok check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4bb3eff1-6fac-4e0a-b399-8f817fb52999
📒 Files selected for processing (5)
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/inforom/component_test.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvswitch/component_test.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component_test.go
Aligns HealthStateIncident with HealthState which already uses Health as the field name. Updates the JSON tag from "severity" to "health" and updates all callers across the SDK and agent. Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
|
@jingxiang-z this looks good, I wonder if we could share a schema between agent and backend so we can prevent this type of mismatch in the future, a well defined protocol between agent and backend would be helpful in other cases too! |
Good question! I'm thinking about doing this in the external sdk. Have a pkg/component or pkg/types to define all types, backend just import this go package. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
internal/exporter/collector/collector_test.go (1)
650-655: Assert the renamed incident field in this test.Lines 650-655 switch the fixture to
Health, but the assertions below only verifyEntityID. A regression that drops or zeroesHealthwould still pass. Please pin the new field here.Small test addition
require.Len(t, incidents, 1) assert.Equal(t, "GPU-1234", incidents[0].EntityID) + assert.Equal(t, apiv1.HealthStateTypeDegraded, incidents[0].Health) + assert.Equal(t, "DCGM_FR_CLOCK_THROTTLE_POWER", incidents[0].Error)Also applies to: 680-684
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/exporter/collector/collector_test.go` around lines 650 - 655, The test fixture was changed to set the incident's Health field but the assertions only check EntityID; update the test to assert the Health value on the apiv1.HealthStateIncident instances (e.g., verify incident.Health == apiv1.HealthStateTypeDegraded) so regressions that drop or zero Health fail; apply the same additional assertion for the other fixture around lines 680-684 where Health was set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@third_party/fleet-intelligence-sdk/api/v1/types.go`:
- Around line 108-109: Restore compatibility by reintroducing the exported
Severity field on the HealthStateIncident type (same type as Health, e.g.
HealthStateType) with json:"severity,omitempty" while keeping the new Health
field json:"health,omitempty"; add MarshalJSON and UnmarshalJSON methods on
HealthStateIncident so incoming JSON with either "severity" or "health"
populates both fields and serialized output continues to include both keys (or
at least preserve the original "severity" key) to avoid breaking older clients
and compile-time consumers.
In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils.go`:
- Around line 235-240: The exported struct EnrichedIncident was mutated
(replacing string Severity with Health and changing field types to dcgm enums
ErrorCode/System/Health), which breaks existing Go and JSON consumers; revert
EnrichedIncident to its original public shape (keep existing field types and
JSON tags) and instead introduce a new versioned type (e.g., EnrichedIncidentV2)
or add explicit compatibility fields (e.g., ErrorCodeInt, SystemInt, HealthInt)
to carry numeric DCGM values; provide a conversion function (e.g., ToV2 or
FromV2) that maps between EnrichedIncident and the new representation so
internal code can use dcgm.HealthCheckErrorCode, dcgm.HealthSystem, and
dcgm.HealthResult without changing the original exported JSON/Go contract.
---
Nitpick comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 650-655: The test fixture was changed to set the incident's Health
field but the assertions only check EntityID; update the test to assert the
Health value on the apiv1.HealthStateIncident instances (e.g., verify
incident.Health == apiv1.HealthStateTypeDegraded) so regressions that drop or
zero Health fail; apply the same additional assertion for the other fixture
around lines 680-684 where Health was set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b5f5e920-148e-4858-a248-5943b486ac99
📒 Files selected for processing (7)
internal/exporter/collector/collector_test.gointernal/exporter/converter/otlp.gointernal/exporter/converter/otlp_test.gothird_party/fleet-intelligence-sdk/api/v1/types.gothird_party/fleet-intelligence-sdk/api/v1/types_test.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils.gothird_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.go
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Description
Checklist
Summary by CodeRabbit