Skip to content

feat: add dcgm version to machine info#143

Merged
jingxiang-z merged 1 commit intomainfrom
feat/add-dcgm-version-machine-info
Mar 24, 2026
Merged

feat: add dcgm version to machine info#143
jingxiang-z merged 1 commit intomainfrom
feat/add-dcgm-version-machine-info

Conversation

@jingxiang-z
Copy link
Copy Markdown
Collaborator

@jingxiang-z jingxiang-z commented Mar 24, 2026

Summary

  • add DCGM HostEngine version detection to machine info as a best-effort field
  • reuse shared DCGM version lookup from precheck and surface the field in table and CSV output
  • add tests covering machine info, precheck, CSV, and OTLP export paths

Behavior Changes

  • machine info exports now include dcgmVersion when DCGM HostEngine is reachable
  • machine info collection does not fail if the DCGM version cannot be determined

Testing

  • go test ./internal/dcgmversion ./internal/machineinfo ./internal/precheck ./internal/exporter/converter

[#1490]

Summary by CodeRabbit

  • New Features
    • System now detects and reports DCGM HostEngine version in machine information
    • DCGM version automatically included in CSV output and OTLP telemetry exports
    • Flexible DCGM connection configuration via environment variables with intelligent defaults

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

📝 Walkthrough

Walkthrough

This change introduces a new dcgmversion package to detect DCGM HostEngine version, extracts and refactors existing version detection logic from the precheck module, integrates the detected version into machine info collection, and exports DCGM version metadata through CSV and OTLP converter outputs.

Changes

Cohort / File(s) Summary
New DCGM Version Detection Package
internal/dcgmversion/dcgmversion.go, internal/dcgmversion/dcgmversion_test.go
Introduces DetectHostengineVersion() function to query DCGM HostEngine version via environment-configured connection parameters (DCGM_URL, DCGM_URL_IS_UNIX_SOCKET). Parses version from semicolon-delimited key:value format. Includes comprehensive unit tests for initialization parameter resolution and version extraction logic.
Machine Info Integration
internal/machineinfo/machineinfo.go, internal/machineinfo/machineinfo_test.go
Adds DCGMVersion field to MachineInfo struct and invokes dcgmversion.DetectHostengineVersion() during machine info collection. Updates table rendering to include DCGM version display. Test suite covers nominal detection, error tolerance (best-effort behavior), and struct validation.
Converter Exports
internal/exporter/converter/csv.go, internal/exporter/converter/csv_test.go, internal/exporter/converter/otlp_test.go
Extends machine info CSV export with DCGM Version row and adds corresponding assertions. Updates OTLP converter test to include dcgmVersion resource attribute validation.
Precheck Refactoring
internal/precheck/precheck.go, internal/precheck/precheck_test.go
Removes inline DCGM initialization and version extraction logic; consolidates into single detectDCGMVersion() call. Simplifies detectDCGM() from multi-step initialization flow to direct invocation of internal/dcgmversion module. Updates test stubs and assertions accordingly.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant MI as MachineInfo<br/>(machineinfo)
    participant DV as DCGMVersion<br/>(dcgmversion)
    participant DCGM as DCGM<br/>HostEngine
    participant Export as Exporters<br/>(CSV/OTLP)

    App->>MI: GetMachineInfo()
    MI->>DV: DetectHostengineVersion()
    DV->>DV: Resolve init params from env
    DV->>DCGM: Initialize connection
    DV->>DCGM: Query hostengine version
    DCGM-->>DV: RawBuildInfoString
    DV->>DV: Extract version from key:value pairs
    DV-->>MI: version string
    MI->>MI: Populate DCGMVersion field
    MI-->>App: MachineInfo with DCGMVersion
    App->>Export: Export machine info
    Export->>Export: Include DCGMVersion in output
    Export-->>App: CSV/OTLP with version metadata
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A version detected with care,
Through DCGM's data we share,
Extracted and parsed with delight,
CSV and OTLP shining bright,
Refactored with hop and with flair! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add dcgm version to machine info' clearly summarizes the main change—adding DCGM version detection and inclusion in machine info output.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/add-dcgm-version-machine-info

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db40a684e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/precheck/precheck.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
internal/exporter/converter/csv.go (1)

261-290: Extract the shared machine-info row list.

writeMachineInfoCSV() still hand-builds the same machine-info rows as (*MachineInfo).RenderTable(), and the new DCGM Version row already landed in a different position here than in the table renderer. A shared row builder would keep the CSV and table outputs from drifting every time a field is added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/exporter/converter/csv.go` around lines 261 - 290, The CSV writer
duplicates the machine-info row construction from the MachineInfo.RenderTable
path and has drifted (e.g., "DCGM Version" ordering); refactor by extracting a
single helper like buildMachineInfoRows(machineInfo *collector.MachineInfo)
[][]string (or a method on MachineInfo) and have both
csvConverter.writeMachineInfoCSV and (*MachineInfo).RenderTable call that helper
to generate the ordered []records; update writeMachineInfoCSV to use that helper
(removing the inline row list) so CSV and table output stay consistent whenever
fields are added or reordered.
internal/machineinfo/machineinfo.go (1)

125-132: Reuse a single DCGM lookup per collection.

GetMachineInfo() now does its own HostEngine version probe here, but CollectInput() in internal/precheck/precheck.go still calls detectDCGM() afterwards. That means one precheck run can init/query DCGM twice, and MachineInfo.DCGMVersion can disagree with Input.DCGMVersion if one probe fails transiently. Consider letting callers pass a known version or skip this lookup when they already did it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/machineinfo/machineinfo.go` around lines 125 - 132, GetMachineInfo()
currently calls getDCGMVersion() internally causing duplicate DCGM probes when
CollectInput() also calls detectDCGM(); change GetMachineInfo() to accept an
optional dcgmVersion parameter (or a boolean to skip probing) and use that value
when provided, otherwise fall back to calling getDCGMVersion(); update callers
like CollectInput()/precheck.go to pass the detected DCGM version (from
detectDCGM()) into GetMachineInfo() so MachineInfo.DCGMVersion and
Input.DCGMVersion come from the same single lookup and transient probe failures
cannot produce divergent values.
internal/dcgmversion/dcgmversion.go (1)

33-42: Add operation context to returned errors.

Line 35 and Line 41 return bare upstream errors, which makes caller logs less actionable. Wrap with context so failures are distinguishable (init vs version info query).

Suggested diff
 import (
+	"fmt"
 	"os"
 	"strconv"
 	"strings"

 	godcgm "github.com/NVIDIA/go-dcgm/pkg/dcgm"
 )
@@
 	cleanup, err := godcgm.Init(godcgm.Standalone, initParams.address, initParams.isUnixSocket)
 	if err != nil {
-		return "", err
+		return "", fmt.Errorf("initialize DCGM standalone connection: %w", err)
 	}
@@
 	versionInfo, err := godcgm.GetHostengineVersionInfo()
 	if err != nil {
-		return "", err
+		return "", fmt.Errorf("get DCGM hostengine version info: %w", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/dcgmversion/dcgmversion.go` around lines 33 - 42, Wrap the bare
errors returned from godcgm.Init and godcgm.GetHostengineVersionInfo with
operation-specific context before returning so callers can distinguish init vs
version-query failures; for example, replace the direct returns after
godcgm.Init(...) and godcgm.GetHostengineVersionInfo() with wrapped errors like
fmt.Errorf("failed to init DCGM: %w", err) and fmt.Errorf("failed to query
hostengine version info: %w", err) respectively, referencing the godcgm.Init and
godcgm.GetHostengineVersionInfo calls to locate the spots to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/dcgmversion/dcgmversion.go`:
- Around line 33-42: Wrap the bare errors returned from godcgm.Init and
godcgm.GetHostengineVersionInfo with operation-specific context before returning
so callers can distinguish init vs version-query failures; for example, replace
the direct returns after godcgm.Init(...) and godcgm.GetHostengineVersionInfo()
with wrapped errors like fmt.Errorf("failed to init DCGM: %w", err) and
fmt.Errorf("failed to query hostengine version info: %w", err) respectively,
referencing the godcgm.Init and godcgm.GetHostengineVersionInfo calls to locate
the spots to change.

In `@internal/exporter/converter/csv.go`:
- Around line 261-290: The CSV writer duplicates the machine-info row
construction from the MachineInfo.RenderTable path and has drifted (e.g., "DCGM
Version" ordering); refactor by extracting a single helper like
buildMachineInfoRows(machineInfo *collector.MachineInfo) [][]string (or a method
on MachineInfo) and have both csvConverter.writeMachineInfoCSV and
(*MachineInfo).RenderTable call that helper to generate the ordered []records;
update writeMachineInfoCSV to use that helper (removing the inline row list) so
CSV and table output stay consistent whenever fields are added or reordered.

In `@internal/machineinfo/machineinfo.go`:
- Around line 125-132: GetMachineInfo() currently calls getDCGMVersion()
internally causing duplicate DCGM probes when CollectInput() also calls
detectDCGM(); change GetMachineInfo() to accept an optional dcgmVersion
parameter (or a boolean to skip probing) and use that value when provided,
otherwise fall back to calling getDCGMVersion(); update callers like
CollectInput()/precheck.go to pass the detected DCGM version (from detectDCGM())
into GetMachineInfo() so MachineInfo.DCGMVersion and Input.DCGMVersion come from
the same single lookup and transient probe failures cannot produce divergent
values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c731d81f-385b-4b2a-8ce4-d73ab091c003

📥 Commits

Reviewing files that changed from the base of the PR and between c07b91f and db40a68.

📒 Files selected for processing (9)
  • internal/dcgmversion/dcgmversion.go
  • internal/dcgmversion/dcgmversion_test.go
  • internal/exporter/converter/csv.go
  • internal/exporter/converter/csv_test.go
  • internal/exporter/converter/otlp_test.go
  • internal/machineinfo/machineinfo.go
  • internal/machineinfo/machineinfo_test.go
  • internal/precheck/precheck.go
  • internal/precheck/precheck_test.go

@jingxiang-z jingxiang-z merged commit 7c5fc96 into main Mar 24, 2026
9 checks passed
@jingxiang-z jingxiang-z deleted the feat/add-dcgm-version-machine-info branch March 24, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants