Skip to content

fix: dcgm incident integer types#148

Merged
jingxiang-z merged 4 commits intomainfrom
fix/dcgm-incident-integer-types
Mar 31, 2026
Merged

fix: dcgm incident integer types#148
jingxiang-z merged 4 commits intomainfrom
fix/dcgm-incident-integer-types

Conversation

@jingxiang-z
Copy link
Copy Markdown
Collaborator

@jingxiang-z jingxiang-z commented Mar 31, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • Refactor
    • Incident payloads now use a dedicated "health" field with typed numeric DCGM result/code values for GPU health reporting.
  • API
    • Incident JSON field formerly labeled "severity" is now "health" in API responses.
  • Tests
    • Test suites updated to match the new incident representation and numeric serialization.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 31, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dbb37adb-d024-4fc8-9937-35d50ba5b399

📥 Commits

Reviewing files that changed from the base of the PR and between 476ec6f and 1bd4deb.

📒 Files selected for processing (3)
  • internal/exporter/collector/collector_test.go
  • internal/exporter/converter/otlp_test.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.go

📝 Walkthrough

Walkthrough

EnrichedIncident now stores DCGM numeric enums for ErrorCode, System, and Health; enrichment fills these typed fields and API-facing Severity/Error strings are derived on demand. Tests, component fixtures, OTLP conversion, and API types updated to use Health and typed DCGM constants with numeric JSON serialization.

Changes

Cohort / File(s) Summary
Core utils & tests
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.go
EnrichedIncident field types changed: ErrorCodedcgm.HealthCheckErrorCode, Systemdcgm.HealthSystem, Severity replaced by Health dcgm.HealthResult. Enrich functions populate typed fields; ToHealthStateIncident() derives API-facing strings on demand. Tests updated to use dcgm constants and assert numeric JSON for code,system,health.
API types & tests
third_party/fleet-intelligence-sdk/api/v1/types.go, third_party/fleet-intelligence-sdk/api/v1/types_test.go
HealthStateIncident renamed SeverityHealth (JSON key severityhealth); tests updated to expect "health" in marshaled output.
Exporter conversion & tests
internal/exporter/converter/otlp.go, internal/exporter/converter/otlp_test.go
OTLP conversion now derives the incident severity attribute from inc.Health instead of inc.Severity. Test fixtures updated accordingly.
Collector & component tests
internal/exporter/collector/collector_test.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/inforom/component_test.go, .../nvswitch/component_test.go, .../power/component_test.go
Test fixtures switched from string ErrorCode/System and Severity to typed dcgm constants and Health. Adjusted assertions where serialized legacy values are now numeric.
Misc test updates
...
Other tests updated across the repo to replace Severity/string codes with Health/typed DCGM constants and to expect numeric serialization for DCGM fields.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • rsampaio
  • moyaz-nvidia

Poem

🐇 I nibble bytes and hop through code,

Enums now march in tidy numeric mode.
Error codes and health in integer dress,
Conversions on demand, no stringy mess.
A rabbit applauds this tidy progress.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix/dcgm incident integer types' accurately reflects the main changes in the pull request, which refactor DCGM incident data structures to use typed integer constants (dcgm.HealthCheckErrorCode, dcgm.HealthSystem, dcgm.HealthResult) instead of string representations.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dcgm-incident-integer-types

Comment @coderabbitai help to get the list of available commands and usage tips.

…lity

The backend DCGMIncident model expects integer values for code, system,
and health fields, but EnrichedIncident was storing and serializing them
as strings, causing json.Unmarshal to fail on the backend.

- Change EnrichedIncident.ErrorCode to dcgm.HealthCheckErrorCode (uint)
- Change EnrichedIncident.System to dcgm.HealthSystem (uint)
- Replace EnrichedIncident.Severity with Health dcgm.HealthResult (uint)
- EnrichIncidents/EnrichSwitchIncidents now store raw DCGM values
- ToHealthStateIncident() converts back to strings for HealthState API
- Add TestEnrichedIncidentJSON_SerializesIntegers to verify the contract
- Update component tests to use DCGM constants instead of string literals

Fixes GPUHEALTH-1721

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@jingxiang-z jingxiang-z force-pushed the fix/dcgm-incident-integer-types branch from c3b7543 to 49d3889 Compare March 31, 2026 18:09
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.go`:
- Around line 361-377: The test currently re-asserts parsed["code"],
parsed["system"], and parsed["health"] with type assertions that can panic;
change each block to perform a single safe type assertion (e.g., val, ok :=
parsed["code"].(float64)) and fail fast (t.Fatalf or t.Errorf then return) if
!ok, then compare the typed val to the expected constant
(dcgm.DCGM_FR_VOLATILE_SBE_DETECTED_TS, dcgm.DCGM_HEALTH_WATCH_MEM,
dcgm.DCGM_HEALTH_RESULT_WARN) using a non-panicking comparison; apply the same
pattern for "system" and "health" in utils_test.go so no direct re-assertions on
parsed[...] occur after the ok check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4bb3eff1-6fac-4e0a-b399-8f817fb52999

📥 Commits

Reviewing files that changed from the base of the PR and between 4b2faa4 and 49d3889.

📒 Files selected for processing (5)
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/inforom/component_test.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvswitch/component_test.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component_test.go

Aligns HealthStateIncident with HealthState which already uses Health
as the field name. Updates the JSON tag from "severity" to "health"
and updates all callers across the SDK and agent.

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@NVIDIA NVIDIA deleted a comment from chatgpt-codex-connector Bot Mar 31, 2026
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@jingxiang-z jingxiang-z self-assigned this Mar 31, 2026
@rsampaio
Copy link
Copy Markdown
Collaborator

@jingxiang-z this looks good, I wonder if we could share a schema between agent and backend so we can prevent this type of mismatch in the future, a well defined protocol between agent and backend would be helpful in other cases too!

@jingxiang-z
Copy link
Copy Markdown
Collaborator Author

@jingxiang-z this looks good, I wonder if we could share a schema between agent and backend so we can prevent this type of mismatch in the future, a well defined protocol between agent and backend would be helpful in other cases too!

Good question! I'm thinking about doing this in the external sdk. Have a pkg/component or pkg/types to define all types, backend just import this go package.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
internal/exporter/collector/collector_test.go (1)

650-655: Assert the renamed incident field in this test.

Lines 650-655 switch the fixture to Health, but the assertions below only verify EntityID. A regression that drops or zeroes Health would still pass. Please pin the new field here.

Small test addition
 	require.Len(t, incidents, 1)
 	assert.Equal(t, "GPU-1234", incidents[0].EntityID)
+	assert.Equal(t, apiv1.HealthStateTypeDegraded, incidents[0].Health)
+	assert.Equal(t, "DCGM_FR_CLOCK_THROTTLE_POWER", incidents[0].Error)

Also applies to: 680-684

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/exporter/collector/collector_test.go` around lines 650 - 655, The
test fixture was changed to set the incident's Health field but the assertions
only check EntityID; update the test to assert the Health value on the
apiv1.HealthStateIncident instances (e.g., verify incident.Health ==
apiv1.HealthStateTypeDegraded) so regressions that drop or zero Health fail;
apply the same additional assertion for the other fixture around lines 680-684
where Health was set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@third_party/fleet-intelligence-sdk/api/v1/types.go`:
- Around line 108-109: Restore compatibility by reintroducing the exported
Severity field on the HealthStateIncident type (same type as Health, e.g.
HealthStateType) with json:"severity,omitempty" while keeping the new Health
field json:"health,omitempty"; add MarshalJSON and UnmarshalJSON methods on
HealthStateIncident so incoming JSON with either "severity" or "health"
populates both fields and serialized output continues to include both keys (or
at least preserve the original "severity" key) to avoid breaking older clients
and compile-time consumers.

In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils.go`:
- Around line 235-240: The exported struct EnrichedIncident was mutated
(replacing string Severity with Health and changing field types to dcgm enums
ErrorCode/System/Health), which breaks existing Go and JSON consumers; revert
EnrichedIncident to its original public shape (keep existing field types and
JSON tags) and instead introduce a new versioned type (e.g., EnrichedIncidentV2)
or add explicit compatibility fields (e.g., ErrorCodeInt, SystemInt, HealthInt)
to carry numeric DCGM values; provide a conversion function (e.g., ToV2 or
FromV2) that maps between EnrichedIncident and the new representation so
internal code can use dcgm.HealthCheckErrorCode, dcgm.HealthSystem, and
dcgm.HealthResult without changing the original exported JSON/Go contract.

---

Nitpick comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 650-655: The test fixture was changed to set the incident's Health
field but the assertions only check EntityID; update the test to assert the
Health value on the apiv1.HealthStateIncident instances (e.g., verify
incident.Health == apiv1.HealthStateTypeDegraded) so regressions that drop or
zero Health fail; apply the same additional assertion for the other fixture
around lines 680-684 where Health was set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b5f5e920-148e-4858-a248-5943b486ac99

📥 Commits

Reviewing files that changed from the base of the PR and between 49d3889 and 476ec6f.

📒 Files selected for processing (7)
  • internal/exporter/collector/collector_test.go
  • internal/exporter/converter/otlp.go
  • internal/exporter/converter/otlp_test.go
  • third_party/fleet-intelligence-sdk/api/v1/types.go
  • third_party/fleet-intelligence-sdk/api/v1/types_test.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/common/utils_test.go

Comment thread third_party/fleet-intelligence-sdk/api/v1/types.go
@jingxiang-z jingxiang-z changed the title Fix/dcgm incident integer types fix: dcgm incident integer types Mar 31, 2026
@jingxiang-z jingxiang-z merged commit 41e9c3b into main Mar 31, 2026
9 checks passed
@jingxiang-z jingxiang-z deleted the fix/dcgm-incident-integer-types branch March 31, 2026 18:59
jingxiang-z added a commit that referenced this pull request Mar 31, 2026
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
jingxiang-z added a commit that referenced this pull request Mar 31, 2026
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants