fix: detect gpu hardware before driver checks by jingxiang-z · Pull Request #158 · NVIDIA/fleet-intelligence-agent

jingxiang-z · 2026-04-07T20:17:06Z

Summary

detect NVIDIA GPU hardware via PCI fallback when NVML cannot provide machine info
report a driver-specific precheck failure when GPU hardware is present but the NVIDIA driver is missing
add regression coverage for the hardware-present, driver-missing precheck path

Testing

go test ./internal/precheck ./cmd/fleetint

Summary by CodeRabbit

Bug Fixes
- GPU pre-checks more reliably detect GPU hardware (including PCI fallback) and now skip or fail checks with clearer, actionable guidance for driver, architecture, NVAttest/DCGM issues.
New Features
- Broader PCI GPU detection to include additional controller types; driver-version is collected and surfaced to checks and messages.
Tests
- Added/updated tests for driver-missing, GPU detail collection failures, architecture-skip behavior, and expanded PCI detection cases.

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

coderabbitai · 2026-04-07T20:17:19Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Refactors GPU data flow: Input now carries explicit GPU fields; CollectInput populates GPUDriverVersion, GPUInfo, and GPUHardwarePresent (with PCI fallback); Evaluate and per-check functions now accept and gate behavior on the new Input fields, and messages/tests updated accordingly.

Changes

Cohort / File(s)	Summary
Precheck core `internal/precheck/precheck.go`	Replaced `machineinfo.MachineInfo` usage with explicit GPU fields on `Input`; CollectInput now queries NVML for driver version, calls `pkgmachineinfo.GetMachineGPUInfo`, and falls back to PCI enumeration with timeout to set `GPUHardwarePresent`. Evaluate and GPU/DCGM/NVAttest checks refactored to read from `Input` and use a shared `gpuHardwareDetected` gating helper; messages updated.
Precheck tests `internal/precheck/precheck_test.go`	Tests rewritten to construct `Input` with explicit GPU fields (`GPUInfo`, `GPUDriverVersion`, `GPUHardwarePresent`, `GPUInfoErr`); added tests for driver-unavailable and GPU-info-failure skip cases; assertions updated to new guidance/skip messages; mocked PCI detection signature updated in tests.
PCI detection `third_party/fleet-intelligence-sdk/pkg/nvidia/pci/detect_lspci.go`, `.../detect_lspci_test.go`	Expanded lspci GPU detection to treat both `3D controller` and `VGA compatible controller` as NVIDIA GPU devices; added test case validating VGA-compatible parsing.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Collector as "CollectInput" rect rgba(100,149,237,0.5)
participant NVML as "NVML" rect rgba(34,139,34,0.5)
participant MachineInfo as "pkgmachineinfo\n(GetMachineGPUInfo)" rect rgba(255,165,0,0.5)
participant PCI as "PCI enumerator\n(listPCIGPUs / lspci)" rect rgba(147,112,219,0.5)
participant Evaluator as "Evaluate" rect rgba(220,20,60,0.5)
participant DCGM as "DCGM / NVAttest" rect rgba(70,130,180,0.5)

Collector->>NVML: request driver version
NVML-->>Collector: GPUDriverVersion (or empty)
Collector->>MachineInfo: request GPUInfo
MachineInfo-->>Collector: GPUInfo or error (GPUInfoErr)
alt No GPUInfo & NVML empty
    Collector->>PCI: enumerate PCI GPUs (5s timeout)
    PCI-->>Collector: GPUs found? -> sets GPUHardwarePresent
end
Collector->>Evaluator: pass *Input (GPUDriverVersion, GPUInfo, GPUHardwarePresent, ...)
Evaluator->>Evaluator: gpuHardwareDetected(input) -> gate checks
Evaluator->>DCGM: run DCGM / NVAttest checks (skipped / run based on input)
DCGM-->>Evaluator: results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

"I hopped through logs and lines so neat,
Gathered drivers, PCI, and heat.
Collected bits of GPU lore,
Skipped a check and checked one more.
A carrot-coded cheer — prechecks complete!" 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: refactoring GPU hardware detection to occur before driver checks, which is the primary objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/precheck-driver-detection

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1db0da939f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

coderabbitai

🧹 Nitpick comments (1)

internal/precheck/precheck.go (1)

256-266: Consider logging PCI detection errors for debugging.

The silent return false on error is reasonable for a fallback path, but logging the error at debug level would help troubleshoot cases where hardware detection unexpectedly fails (e.g., lspci not installed, permission issues).

♻️ Optional: Add debug logging

 func detectGPUHardware() bool {
 	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
 	defer cancel()

 	devs, err := listPCIGPUs(ctx)
 	if err != nil {
+		// Log at debug level since this is a fallback detection path
+		// log.Debug("PCI GPU detection failed", "error", err)
 		return false
 	}

 	return len(devs) > 0
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` around lines 256 - 266, In detectGPUHardware,
when listPCIGPUs(ctx) returns an error you should log that error at debug level
before returning false; update the error branch in detectGPUHardware to emit a
debug log containing the error (and brief context such as that PCI GPU detection
failed) using your project's debug logger, then return false—this keeps the
fallback behavior but surfaces useful info for troubleshooting when listPCIGPUs
fails.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/precheck/precheck.go`:
- Around line 256-266: In detectGPUHardware, when listPCIGPUs(ctx) returns an
error you should log that error at debug level before returning false; update
the error branch in detectGPUHardware to emit a debug log containing the error
(and brief context such as that PCI GPU detection failed) using your project's
debug logger, then return false—this keeps the fallback behavior but surfaces
useful info for troubleshooting when listPCIGPUs fails.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8accfe60-068c-4432-8a04-4a52dd6ff39c

📥 Commits

Reviewing files that changed from the base of the PR and between 810acfb and 1db0da9.

📒 Files selected for processing (2)

internal/precheck/precheck.go
internal/precheck/precheck_test.go

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

rsampaio

Left a non-block comment, the GPU detection itself looks good and would be nice to log exactly what the issue is in the check and point to the right remediation

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

coderabbitai

🧹 Nitpick comments (4)

internal/precheck/precheck.go (4)

262-272: Consider logging PCI detection errors for debuggability.

When listPCIGPUs fails (e.g., lspci not installed, permission denied), the error is silently discarded. While returning false is correct fallback behavior, logging the error at debug level would help operators troubleshoot why GPU hardware wasn't detected.

Proposed enhancement

 func detectGPUHardware() bool {
 	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
 	defer cancel()

 	devs, err := listPCIGPUs(ctx)
 	if err != nil {
+		// Log at debug level to aid troubleshooting without alarming operators
+		log.Logger.Debugw("PCI GPU detection failed", "error", err)
 		return false
 	}

 	return len(devs) > 0
 }

This requires importing the log package.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` around lines 262 - 272, The detectGPUHardware
function currently swallows errors from listPCIGPUs; update it to log the
returned error at debug level before returning false so operators can diagnose
failures. Specifically, in detectGPUHardware (which calls listPCIGPUs) add a
debug log call that includes the error value and context (e.g., "failed to list
PCI GPUs") when err != nil, and ensure the package imports and uses the
project's logging package (add the appropriate log import) rather than
discarding the error.

164-164: Pass Input by pointer to avoid copying 80 bytes.

Same as evaluateGPUPresence, this function only reads from input and should accept a pointer for efficiency.

Proposed fix

-func evaluateArchitecture(input Input) Check {
+func evaluateArchitecture(input *Input) Check {

Update call sites and helper functions accordingly.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` at line 164, The function evaluateArchitecture
currently takes an Input by value which copies ~80 bytes; change its signature
to accept *Input (pointer) like evaluateGPUPresence to avoid unnecessary
copying, update all call sites that invoke evaluateArchitecture to pass the
address (&inputVar) and adjust any helper functions or tests that referenced the
old signature to accept *Input as well; ensure no code mutates the input
unexpectedly (preserve read-only behavior) and run build/tests to catch places
missed.

250-256: Consider accepting *Input for consistency if refactoring other functions.

If you refactor evaluateGPUPresence and evaluateArchitecture to accept *Input, update gpuHardwareDetected accordingly for consistency:

-func gpuHardwareDetected(input Input) bool {
+func gpuHardwareDetected(input *Input) bool {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` around lines 250 - 256, Update
gpuHardwareDetected to accept a pointer to Input to match the refactor of
evaluateGPUPresence and evaluateArchitecture: change the signature from func
gpuHardwareDetected(input Input) bool to func gpuHardwareDetected(input *Input)
bool, update its body to dereference or use the pointer fields
(input.GPUHardwarePresent and input.GPUInfo) accordingly, and update all call
sites (e.g., where evaluateGPUPresence/evaluateArchitecture now pass *Input) to
pass the same *Input to gpuHardwareDetected so parameter types remain consistent
across these helper functions.

149-162: Pass Input by pointer to avoid copying 80 bytes.

The static analysis correctly identifies that Input (80 bytes) is passed by value. Since this function only reads from input, passing by pointer avoids unnecessary copies and improves efficiency.

Proposed fix

-func evaluateGPUPresence(input Input) Check {
-	if !gpuHardwareDetected(input) {
+func evaluateGPUPresence(input *Input) Check {
+	if !gpuHardwareDetected(input) {

This also requires updating gpuHardwareDetected to accept *Input and updating the call site in Evaluate.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` around lines 149 - 162, Change
evaluateGPUPresence to accept *Input instead of Input and update
gpuHardwareDetected to accept *Input as well to avoid copying the ~80-byte
struct; inside evaluateGPUPresence use the pointer parameter for the check and
return the same Check values. Also update the call site in Evaluate (where
evaluateGPUPresence is invoked) to pass the address of the Input (e.g., &input)
so the new pointer signatures are used consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/precheck/precheck.go`:
- Around line 262-272: The detectGPUHardware function currently swallows errors
from listPCIGPUs; update it to log the returned error at debug level before
returning false so operators can diagnose failures. Specifically, in
detectGPUHardware (which calls listPCIGPUs) add a debug log call that includes
the error value and context (e.g., "failed to list PCI GPUs") when err != nil,
and ensure the package imports and uses the project's logging package (add the
appropriate log import) rather than discarding the error.
- Line 164: The function evaluateArchitecture currently takes an Input by value
which copies ~80 bytes; change its signature to accept *Input (pointer) like
evaluateGPUPresence to avoid unnecessary copying, update all call sites that
invoke evaluateArchitecture to pass the address (&inputVar) and adjust any
helper functions or tests that referenced the old signature to accept *Input as
well; ensure no code mutates the input unexpectedly (preserve read-only
behavior) and run build/tests to catch places missed.
- Around line 250-256: Update gpuHardwareDetected to accept a pointer to Input
to match the refactor of evaluateGPUPresence and evaluateArchitecture: change
the signature from func gpuHardwareDetected(input Input) bool to func
gpuHardwareDetected(input *Input) bool, update its body to dereference or use
the pointer fields (input.GPUHardwarePresent and input.GPUInfo) accordingly, and
update all call sites (e.g., where evaluateGPUPresence/evaluateArchitecture now
pass *Input) to pass the same *Input to gpuHardwareDetected so parameter types
remain consistent across these helper functions.
- Around line 149-162: Change evaluateGPUPresence to accept *Input instead of
Input and update gpuHardwareDetected to accept *Input as well to avoid copying
the ~80-byte struct; inside evaluateGPUPresence use the pointer parameter for
the check and return the same Check values. Also update the call site in
Evaluate (where evaluateGPUPresence is invoked) to pass the address of the Input
(e.g., &input) so the new pointer signatures are used consistently.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ef309110-5f36-415f-a1b5-130412c2ee0a

📥 Commits

Reviewing files that changed from the base of the PR and between 5e3607b and 0008696.

📒 Files selected for processing (2)

internal/precheck/precheck.go
internal/precheck/precheck_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/precheck/precheck_test.go

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

jingxiang-z · 2026-04-07T22:39:17Z

@coderabbitai review

coderabbitai · 2026-04-07T22:39:22Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/precheck/precheck.go`:
- Around line 266-275: The current detectGPUHardware hides probe failures by
returning false on any listPCIGPUs error; change detectGPUHardware to return
(bool, error) and propagate the underlying error from listPCIGPUs instead of
swallowing it (i.e., call devs, err := listPCIGPUs(ctx); if err != nil { return
false, err } ; return len(devs) > 0, nil). Then update callers (the Input
construction and evaluateGPUPresence flow) to accept the (bool, error) result
and treat an error as "unable to determine GPU presence" (surface the error
upstream) while only treating a nil error with empty slice as the real "no GPU"
case. Ensure references to detectGPUHardware, listPCIGPUs, and
evaluateGPUPresence are updated accordingly.
- Around line 137-143: Evaluate currently dereferences input.NVAttestPresent
while building the checks slice which panics when input is nil; fix by computing
a safe nvAttestPresent variable before constructing checks (e.g.,
nvAttestPresent := false; if input != nil { nvAttestPresent =
input.NVAttestPresent }) and then call evaluateNVAttest(nvAttestPresent) inside
Evaluate so the nil guard in per-check functions can run without the pre-check
panic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 27d3ef74-0cc3-4466-aa83-1aac1e69227c

📥 Commits

Reviewing files that changed from the base of the PR and between 0008696 and 693a33b.

📒 Files selected for processing (2)

internal/precheck/precheck.go
internal/precheck/precheck_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/precheck/precheck_test.go

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

fix: detect gpu hardware before driver checks

1db0da9

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

chatgpt-codex-connector Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread internal/precheck/precheck.go Outdated

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

jingxiang-z requested a review from rsampaio April 7, 2026 20:45

jingxiang-z self-assigned this Apr 7, 2026

fix: detect vga-compatible nvidia gpus in pci fallback

1a3424f

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

rsampaio approved these changes Apr 7, 2026

View reviewed changes

jingxiang-z added 2 commits April 7, 2026 15:21

fix: improve precheck remediation messages

5e3607b

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

refactor: use gpu-specific precheck collection

0008696

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

jingxiang-z added 2 commits April 7, 2026 15:31

fix: avoid copying precheck input

b07d3c1

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

fix: avoid copying dcgm precheck input

693a33b

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread internal/precheck/precheck.go

Comment thread internal/precheck/precheck.go Outdated

fix: handle precheck input and pci probe errors

4a50559

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

jingxiang-z merged commit 00311d7 into main Apr 7, 2026
9 checks passed

jingxiang-z deleted the fix/precheck-driver-detection branch April 7, 2026 23:28

Conversation

jingxiang-z commented Apr 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

rsampaio left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jingxiang-z commented Apr 7, 2026

Uh oh!

coderabbitai Bot commented Apr 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jingxiang-z commented Apr 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 7, 2026 •

edited

Loading