Skip to content

fix: detect gpu hardware before driver checks#158

Merged
jingxiang-z merged 7 commits intomainfrom
fix/precheck-driver-detection
Apr 7, 2026
Merged

fix: detect gpu hardware before driver checks#158
jingxiang-z merged 7 commits intomainfrom
fix/precheck-driver-detection

Conversation

@jingxiang-z
Copy link
Copy Markdown
Collaborator

@jingxiang-z jingxiang-z commented Apr 7, 2026

Summary

  • detect NVIDIA GPU hardware via PCI fallback when NVML cannot provide machine info
  • report a driver-specific precheck failure when GPU hardware is present but the NVIDIA driver is missing
  • add regression coverage for the hardware-present, driver-missing precheck path

Testing

  • go test ./internal/precheck ./cmd/fleetint

Related

Summary by CodeRabbit

  • Bug Fixes
    • GPU pre-checks more reliably detect GPU hardware (including PCI fallback) and now skip or fail checks with clearer, actionable guidance for driver, architecture, NVAttest/DCGM issues.
  • New Features
    • Broader PCI GPU detection to include additional controller types; driver-version is collected and surfaced to checks and messages.
  • Tests
    • Added/updated tests for driver-missing, GPU detail collection failures, architecture-skip behavior, and expanded PCI detection cases.

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Refactors GPU data flow: Input now carries explicit GPU fields; CollectInput populates GPUDriverVersion, GPUInfo, and GPUHardwarePresent (with PCI fallback); Evaluate and per-check functions now accept and gate behavior on the new Input fields, and messages/tests updated accordingly.

Changes

Cohort / File(s) Summary
Precheck core
internal/precheck/precheck.go
Replaced *machineinfo.MachineInfo usage with explicit GPU fields on Input; CollectInput now queries NVML for driver version, calls pkgmachineinfo.GetMachineGPUInfo, and falls back to PCI enumeration with timeout to set GPUHardwarePresent. Evaluate and GPU/DCGM/NVAttest checks refactored to read from *Input and use a shared gpuHardwareDetected gating helper; messages updated.
Precheck tests
internal/precheck/precheck_test.go
Tests rewritten to construct Input with explicit GPU fields (GPUInfo, GPUDriverVersion, GPUHardwarePresent, GPUInfoErr); added tests for driver-unavailable and GPU-info-failure skip cases; assertions updated to new guidance/skip messages; mocked PCI detection signature updated in tests.
PCI detection
third_party/fleet-intelligence-sdk/pkg/nvidia/pci/detect_lspci.go, .../detect_lspci_test.go
Expanded lspci GPU detection to treat both 3D controller and VGA compatible controller as NVIDIA GPU devices; added test case validating VGA-compatible parsing.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Collector as "CollectInput" rect rgba(100,149,237,0.5)
participant NVML as "NVML" rect rgba(34,139,34,0.5)
participant MachineInfo as "pkgmachineinfo\n(GetMachineGPUInfo)" rect rgba(255,165,0,0.5)
participant PCI as "PCI enumerator\n(listPCIGPUs / lspci)" rect rgba(147,112,219,0.5)
participant Evaluator as "Evaluate" rect rgba(220,20,60,0.5)
participant DCGM as "DCGM / NVAttest" rect rgba(70,130,180,0.5)

Collector->>NVML: request driver version
NVML-->>Collector: GPUDriverVersion (or empty)
Collector->>MachineInfo: request GPUInfo
MachineInfo-->>Collector: GPUInfo or error (GPUInfoErr)
alt No GPUInfo & NVML empty
    Collector->>PCI: enumerate PCI GPUs (5s timeout)
    PCI-->>Collector: GPUs found? -> sets GPUHardwarePresent
end
Collector->>Evaluator: pass *Input (GPUDriverVersion, GPUInfo, GPUHardwarePresent, ...)
Evaluator->>Evaluator: gpuHardwareDetected(input) -> gate checks
Evaluator->>DCGM: run DCGM / NVAttest checks (skipped / run based on input)
DCGM-->>Evaluator: results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

"I hopped through logs and lines so neat,
Gathered drivers, PCI, and heat.
Collected bits of GPU lore,
Skipped a check and checked one more.
A carrot-coded cheer — prechecks complete!" 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: refactoring GPU hardware detection to occur before driver checks, which is the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/precheck-driver-detection

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1db0da939f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/precheck/precheck.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/precheck/precheck.go (1)

256-266: Consider logging PCI detection errors for debugging.

The silent return false on error is reasonable for a fallback path, but logging the error at debug level would help troubleshoot cases where hardware detection unexpectedly fails (e.g., lspci not installed, permission issues).

♻️ Optional: Add debug logging
 func detectGPUHardware() bool {
 	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
 	defer cancel()

 	devs, err := listPCIGPUs(ctx)
 	if err != nil {
+		// Log at debug level since this is a fallback detection path
+		// log.Debug("PCI GPU detection failed", "error", err)
 		return false
 	}

 	return len(devs) > 0
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` around lines 256 - 266, In detectGPUHardware,
when listPCIGPUs(ctx) returns an error you should log that error at debug level
before returning false; update the error branch in detectGPUHardware to emit a
debug log containing the error (and brief context such as that PCI GPU detection
failed) using your project's debug logger, then return false—this keeps the
fallback behavior but surfaces useful info for troubleshooting when listPCIGPUs
fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/precheck/precheck.go`:
- Around line 256-266: In detectGPUHardware, when listPCIGPUs(ctx) returns an
error you should log that error at debug level before returning false; update
the error branch in detectGPUHardware to emit a debug log containing the error
(and brief context such as that PCI GPU detection failed) using your project's
debug logger, then return false—this keeps the fallback behavior but surfaces
useful info for troubleshooting when listPCIGPUs fails.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8accfe60-068c-4432-8a04-4a52dd6ff39c

📥 Commits

Reviewing files that changed from the base of the PR and between 810acfb and 1db0da9.

📒 Files selected for processing (2)
  • internal/precheck/precheck.go
  • internal/precheck/precheck_test.go

@jingxiang-z jingxiang-z requested a review from rsampaio April 7, 2026 20:45
@jingxiang-z jingxiang-z self-assigned this Apr 7, 2026
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Copy link
Copy Markdown
Collaborator

@rsampaio rsampaio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a non-block comment, the GPU detection itself looks good and would be nice to log exactly what the issue is in the check and point to the right remediation

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
internal/precheck/precheck.go (4)

262-272: Consider logging PCI detection errors for debuggability.

When listPCIGPUs fails (e.g., lspci not installed, permission denied), the error is silently discarded. While returning false is correct fallback behavior, logging the error at debug level would help operators troubleshoot why GPU hardware wasn't detected.

Proposed enhancement
 func detectGPUHardware() bool {
 	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
 	defer cancel()

 	devs, err := listPCIGPUs(ctx)
 	if err != nil {
+		// Log at debug level to aid troubleshooting without alarming operators
+		log.Logger.Debugw("PCI GPU detection failed", "error", err)
 		return false
 	}

 	return len(devs) > 0
 }

This requires importing the log package.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` around lines 262 - 272, The detectGPUHardware
function currently swallows errors from listPCIGPUs; update it to log the
returned error at debug level before returning false so operators can diagnose
failures. Specifically, in detectGPUHardware (which calls listPCIGPUs) add a
debug log call that includes the error value and context (e.g., "failed to list
PCI GPUs") when err != nil, and ensure the package imports and uses the
project's logging package (add the appropriate log import) rather than
discarding the error.

164-164: Pass Input by pointer to avoid copying 80 bytes.

Same as evaluateGPUPresence, this function only reads from input and should accept a pointer for efficiency.

Proposed fix
-func evaluateArchitecture(input Input) Check {
+func evaluateArchitecture(input *Input) Check {

Update call sites and helper functions accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` at line 164, The function evaluateArchitecture
currently takes an Input by value which copies ~80 bytes; change its signature
to accept *Input (pointer) like evaluateGPUPresence to avoid unnecessary
copying, update all call sites that invoke evaluateArchitecture to pass the
address (&inputVar) and adjust any helper functions or tests that referenced the
old signature to accept *Input as well; ensure no code mutates the input
unexpectedly (preserve read-only behavior) and run build/tests to catch places
missed.

250-256: Consider accepting *Input for consistency if refactoring other functions.

If you refactor evaluateGPUPresence and evaluateArchitecture to accept *Input, update gpuHardwareDetected accordingly for consistency:

-func gpuHardwareDetected(input Input) bool {
+func gpuHardwareDetected(input *Input) bool {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` around lines 250 - 256, Update
gpuHardwareDetected to accept a pointer to Input to match the refactor of
evaluateGPUPresence and evaluateArchitecture: change the signature from func
gpuHardwareDetected(input Input) bool to func gpuHardwareDetected(input *Input)
bool, update its body to dereference or use the pointer fields
(input.GPUHardwarePresent and input.GPUInfo) accordingly, and update all call
sites (e.g., where evaluateGPUPresence/evaluateArchitecture now pass *Input) to
pass the same *Input to gpuHardwareDetected so parameter types remain consistent
across these helper functions.

149-162: Pass Input by pointer to avoid copying 80 bytes.

The static analysis correctly identifies that Input (80 bytes) is passed by value. Since this function only reads from input, passing by pointer avoids unnecessary copies and improves efficiency.

Proposed fix
-func evaluateGPUPresence(input Input) Check {
-	if !gpuHardwareDetected(input) {
+func evaluateGPUPresence(input *Input) Check {
+	if !gpuHardwareDetected(input) {

This also requires updating gpuHardwareDetected to accept *Input and updating the call site in Evaluate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/precheck/precheck.go` around lines 149 - 162, Change
evaluateGPUPresence to accept *Input instead of Input and update
gpuHardwareDetected to accept *Input as well to avoid copying the ~80-byte
struct; inside evaluateGPUPresence use the pointer parameter for the check and
return the same Check values. Also update the call site in Evaluate (where
evaluateGPUPresence is invoked) to pass the address of the Input (e.g., &input)
so the new pointer signatures are used consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/precheck/precheck.go`:
- Around line 262-272: The detectGPUHardware function currently swallows errors
from listPCIGPUs; update it to log the returned error at debug level before
returning false so operators can diagnose failures. Specifically, in
detectGPUHardware (which calls listPCIGPUs) add a debug log call that includes
the error value and context (e.g., "failed to list PCI GPUs") when err != nil,
and ensure the package imports and uses the project's logging package (add the
appropriate log import) rather than discarding the error.
- Line 164: The function evaluateArchitecture currently takes an Input by value
which copies ~80 bytes; change its signature to accept *Input (pointer) like
evaluateGPUPresence to avoid unnecessary copying, update all call sites that
invoke evaluateArchitecture to pass the address (&inputVar) and adjust any
helper functions or tests that referenced the old signature to accept *Input as
well; ensure no code mutates the input unexpectedly (preserve read-only
behavior) and run build/tests to catch places missed.
- Around line 250-256: Update gpuHardwareDetected to accept a pointer to Input
to match the refactor of evaluateGPUPresence and evaluateArchitecture: change
the signature from func gpuHardwareDetected(input Input) bool to func
gpuHardwareDetected(input *Input) bool, update its body to dereference or use
the pointer fields (input.GPUHardwarePresent and input.GPUInfo) accordingly, and
update all call sites (e.g., where evaluateGPUPresence/evaluateArchitecture now
pass *Input) to pass the same *Input to gpuHardwareDetected so parameter types
remain consistent across these helper functions.
- Around line 149-162: Change evaluateGPUPresence to accept *Input instead of
Input and update gpuHardwareDetected to accept *Input as well to avoid copying
the ~80-byte struct; inside evaluateGPUPresence use the pointer parameter for
the check and return the same Check values. Also update the call site in
Evaluate (where evaluateGPUPresence is invoked) to pass the address of the Input
(e.g., &input) so the new pointer signatures are used consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ef309110-5f36-415f-a1b5-130412c2ee0a

📥 Commits

Reviewing files that changed from the base of the PR and between 5e3607b and 0008696.

📒 Files selected for processing (2)
  • internal/precheck/precheck.go
  • internal/precheck/precheck_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/precheck/precheck_test.go

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@jingxiang-z
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/precheck/precheck.go`:
- Around line 266-275: The current detectGPUHardware hides probe failures by
returning false on any listPCIGPUs error; change detectGPUHardware to return
(bool, error) and propagate the underlying error from listPCIGPUs instead of
swallowing it (i.e., call devs, err := listPCIGPUs(ctx); if err != nil { return
false, err } ; return len(devs) > 0, nil). Then update callers (the Input
construction and evaluateGPUPresence flow) to accept the (bool, error) result
and treat an error as "unable to determine GPU presence" (surface the error
upstream) while only treating a nil error with empty slice as the real "no GPU"
case. Ensure references to detectGPUHardware, listPCIGPUs, and
evaluateGPUPresence are updated accordingly.
- Around line 137-143: Evaluate currently dereferences input.NVAttestPresent
while building the checks slice which panics when input is nil; fix by computing
a safe nvAttestPresent variable before constructing checks (e.g.,
nvAttestPresent := false; if input != nil { nvAttestPresent =
input.NVAttestPresent }) and then call evaluateNVAttest(nvAttestPresent) inside
Evaluate so the nil guard in per-check functions can run without the pre-check
panic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 27d3ef74-0cc3-4466-aa83-1aac1e69227c

📥 Commits

Reviewing files that changed from the base of the PR and between 0008696 and 693a33b.

📒 Files selected for processing (2)
  • internal/precheck/precheck.go
  • internal/precheck/precheck_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/precheck/precheck_test.go

Comment thread internal/precheck/precheck.go
Comment thread internal/precheck/precheck.go Outdated
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@jingxiang-z jingxiang-z merged commit 00311d7 into main Apr 7, 2026
9 checks passed
@jingxiang-z jingxiang-z deleted the fix/precheck-driver-detection branch April 7, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants