feat: add dcgm version to machine info#143
Conversation
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
📝 WalkthroughWalkthroughThis change introduces a new Changes
Sequence DiagramsequenceDiagram
participant App as Application
participant MI as MachineInfo<br/>(machineinfo)
participant DV as DCGMVersion<br/>(dcgmversion)
participant DCGM as DCGM<br/>HostEngine
participant Export as Exporters<br/>(CSV/OTLP)
App->>MI: GetMachineInfo()
MI->>DV: DetectHostengineVersion()
DV->>DV: Resolve init params from env
DV->>DCGM: Initialize connection
DV->>DCGM: Query hostengine version
DCGM-->>DV: RawBuildInfoString
DV->>DV: Extract version from key:value pairs
DV-->>MI: version string
MI->>MI: Populate DCGMVersion field
MI-->>App: MachineInfo with DCGMVersion
App->>Export: Export machine info
Export->>Export: Include DCGMVersion in output
Export-->>App: CSV/OTLP with version metadata
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: db40a684e0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
🧹 Nitpick comments (3)
internal/exporter/converter/csv.go (1)
261-290: Extract the shared machine-info row list.
writeMachineInfoCSV()still hand-builds the same machine-info rows as(*MachineInfo).RenderTable(), and the newDCGM Versionrow already landed in a different position here than in the table renderer. A shared row builder would keep the CSV and table outputs from drifting every time a field is added.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/exporter/converter/csv.go` around lines 261 - 290, The CSV writer duplicates the machine-info row construction from the MachineInfo.RenderTable path and has drifted (e.g., "DCGM Version" ordering); refactor by extracting a single helper like buildMachineInfoRows(machineInfo *collector.MachineInfo) [][]string (or a method on MachineInfo) and have both csvConverter.writeMachineInfoCSV and (*MachineInfo).RenderTable call that helper to generate the ordered []records; update writeMachineInfoCSV to use that helper (removing the inline row list) so CSV and table output stay consistent whenever fields are added or reordered.internal/machineinfo/machineinfo.go (1)
125-132: Reuse a single DCGM lookup per collection.
GetMachineInfo()now does its own HostEngine version probe here, butCollectInput()ininternal/precheck/precheck.gostill callsdetectDCGM()afterwards. That means one precheck run can init/query DCGM twice, andMachineInfo.DCGMVersioncan disagree withInput.DCGMVersionif one probe fails transiently. Consider letting callers pass a known version or skip this lookup when they already did it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/machineinfo/machineinfo.go` around lines 125 - 132, GetMachineInfo() currently calls getDCGMVersion() internally causing duplicate DCGM probes when CollectInput() also calls detectDCGM(); change GetMachineInfo() to accept an optional dcgmVersion parameter (or a boolean to skip probing) and use that value when provided, otherwise fall back to calling getDCGMVersion(); update callers like CollectInput()/precheck.go to pass the detected DCGM version (from detectDCGM()) into GetMachineInfo() so MachineInfo.DCGMVersion and Input.DCGMVersion come from the same single lookup and transient probe failures cannot produce divergent values.internal/dcgmversion/dcgmversion.go (1)
33-42: Add operation context to returned errors.Line 35 and Line 41 return bare upstream errors, which makes caller logs less actionable. Wrap with context so failures are distinguishable (
initvsversion info query).Suggested diff
import ( + "fmt" "os" "strconv" "strings" godcgm "github.com/NVIDIA/go-dcgm/pkg/dcgm" ) @@ cleanup, err := godcgm.Init(godcgm.Standalone, initParams.address, initParams.isUnixSocket) if err != nil { - return "", err + return "", fmt.Errorf("initialize DCGM standalone connection: %w", err) } @@ versionInfo, err := godcgm.GetHostengineVersionInfo() if err != nil { - return "", err + return "", fmt.Errorf("get DCGM hostengine version info: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/dcgmversion/dcgmversion.go` around lines 33 - 42, Wrap the bare errors returned from godcgm.Init and godcgm.GetHostengineVersionInfo with operation-specific context before returning so callers can distinguish init vs version-query failures; for example, replace the direct returns after godcgm.Init(...) and godcgm.GetHostengineVersionInfo() with wrapped errors like fmt.Errorf("failed to init DCGM: %w", err) and fmt.Errorf("failed to query hostengine version info: %w", err) respectively, referencing the godcgm.Init and godcgm.GetHostengineVersionInfo calls to locate the spots to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/dcgmversion/dcgmversion.go`:
- Around line 33-42: Wrap the bare errors returned from godcgm.Init and
godcgm.GetHostengineVersionInfo with operation-specific context before returning
so callers can distinguish init vs version-query failures; for example, replace
the direct returns after godcgm.Init(...) and godcgm.GetHostengineVersionInfo()
with wrapped errors like fmt.Errorf("failed to init DCGM: %w", err) and
fmt.Errorf("failed to query hostengine version info: %w", err) respectively,
referencing the godcgm.Init and godcgm.GetHostengineVersionInfo calls to locate
the spots to change.
In `@internal/exporter/converter/csv.go`:
- Around line 261-290: The CSV writer duplicates the machine-info row
construction from the MachineInfo.RenderTable path and has drifted (e.g., "DCGM
Version" ordering); refactor by extracting a single helper like
buildMachineInfoRows(machineInfo *collector.MachineInfo) [][]string (or a method
on MachineInfo) and have both csvConverter.writeMachineInfoCSV and
(*MachineInfo).RenderTable call that helper to generate the ordered []records;
update writeMachineInfoCSV to use that helper (removing the inline row list) so
CSV and table output stay consistent whenever fields are added or reordered.
In `@internal/machineinfo/machineinfo.go`:
- Around line 125-132: GetMachineInfo() currently calls getDCGMVersion()
internally causing duplicate DCGM probes when CollectInput() also calls
detectDCGM(); change GetMachineInfo() to accept an optional dcgmVersion
parameter (or a boolean to skip probing) and use that value when provided,
otherwise fall back to calling getDCGMVersion(); update callers like
CollectInput()/precheck.go to pass the detected DCGM version (from detectDCGM())
into GetMachineInfo() so MachineInfo.DCGMVersion and Input.DCGMVersion come from
the same single lookup and transient probe failures cannot produce divergent
values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c731d81f-385b-4b2a-8ce4-d73ab091c003
📒 Files selected for processing (9)
internal/dcgmversion/dcgmversion.gointernal/dcgmversion/dcgmversion_test.gointernal/exporter/converter/csv.gointernal/exporter/converter/csv_test.gointernal/exporter/converter/otlp_test.gointernal/machineinfo/machineinfo.gointernal/machineinfo/machineinfo_test.gointernal/precheck/precheck.gointernal/precheck/precheck_test.go
Summary
Behavior Changes
dcgmVersionwhen DCGM HostEngine is reachableTesting
[#1490]
Summary by CodeRabbit