fix: detect gpu hardware before driver checks#158
Conversation
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRefactors GPU data flow: Input now carries explicit GPU fields; CollectInput populates GPUDriverVersion, GPUInfo, and GPUHardwarePresent (with PCI fallback); Evaluate and per-check functions now accept and gate behavior on the new Input fields, and messages/tests updated accordingly. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1db0da939f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/precheck/precheck.go (1)
256-266: Consider logging PCI detection errors for debugging.The silent
return falseon error is reasonable for a fallback path, but logging the error at debug level would help troubleshoot cases where hardware detection unexpectedly fails (e.g.,lspcinot installed, permission issues).♻️ Optional: Add debug logging
func detectGPUHardware() bool { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() devs, err := listPCIGPUs(ctx) if err != nil { + // Log at debug level since this is a fallback detection path + // log.Debug("PCI GPU detection failed", "error", err) return false } return len(devs) > 0 }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/precheck/precheck.go` around lines 256 - 266, In detectGPUHardware, when listPCIGPUs(ctx) returns an error you should log that error at debug level before returning false; update the error branch in detectGPUHardware to emit a debug log containing the error (and brief context such as that PCI GPU detection failed) using your project's debug logger, then return false—this keeps the fallback behavior but surfaces useful info for troubleshooting when listPCIGPUs fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/precheck/precheck.go`:
- Around line 256-266: In detectGPUHardware, when listPCIGPUs(ctx) returns an
error you should log that error at debug level before returning false; update
the error branch in detectGPUHardware to emit a debug log containing the error
(and brief context such as that PCI GPU detection failed) using your project's
debug logger, then return false—this keeps the fallback behavior but surfaces
useful info for troubleshooting when listPCIGPUs fails.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8accfe60-068c-4432-8a04-4a52dd6ff39c
📒 Files selected for processing (2)
internal/precheck/precheck.gointernal/precheck/precheck_test.go
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
rsampaio
left a comment
There was a problem hiding this comment.
Left a non-block comment, the GPU detection itself looks good and would be nice to log exactly what the issue is in the check and point to the right remediation
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (4)
internal/precheck/precheck.go (4)
262-272: Consider logging PCI detection errors for debuggability.When
listPCIGPUsfails (e.g.,lspcinot installed, permission denied), the error is silently discarded. While returningfalseis correct fallback behavior, logging the error at debug level would help operators troubleshoot why GPU hardware wasn't detected.Proposed enhancement
func detectGPUHardware() bool { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() devs, err := listPCIGPUs(ctx) if err != nil { + // Log at debug level to aid troubleshooting without alarming operators + log.Logger.Debugw("PCI GPU detection failed", "error", err) return false } return len(devs) > 0 }This requires importing the log package.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/precheck/precheck.go` around lines 262 - 272, The detectGPUHardware function currently swallows errors from listPCIGPUs; update it to log the returned error at debug level before returning false so operators can diagnose failures. Specifically, in detectGPUHardware (which calls listPCIGPUs) add a debug log call that includes the error value and context (e.g., "failed to list PCI GPUs") when err != nil, and ensure the package imports and uses the project's logging package (add the appropriate log import) rather than discarding the error.
164-164: PassInputby pointer to avoid copying 80 bytes.Same as
evaluateGPUPresence, this function only reads frominputand should accept a pointer for efficiency.Proposed fix
-func evaluateArchitecture(input Input) Check { +func evaluateArchitecture(input *Input) Check {Update call sites and helper functions accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/precheck/precheck.go` at line 164, The function evaluateArchitecture currently takes an Input by value which copies ~80 bytes; change its signature to accept *Input (pointer) like evaluateGPUPresence to avoid unnecessary copying, update all call sites that invoke evaluateArchitecture to pass the address (&inputVar) and adjust any helper functions or tests that referenced the old signature to accept *Input as well; ensure no code mutates the input unexpectedly (preserve read-only behavior) and run build/tests to catch places missed.
250-256: Consider accepting*Inputfor consistency if refactoring other functions.If you refactor
evaluateGPUPresenceandevaluateArchitectureto accept*Input, updategpuHardwareDetectedaccordingly for consistency:-func gpuHardwareDetected(input Input) bool { +func gpuHardwareDetected(input *Input) bool {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/precheck/precheck.go` around lines 250 - 256, Update gpuHardwareDetected to accept a pointer to Input to match the refactor of evaluateGPUPresence and evaluateArchitecture: change the signature from func gpuHardwareDetected(input Input) bool to func gpuHardwareDetected(input *Input) bool, update its body to dereference or use the pointer fields (input.GPUHardwarePresent and input.GPUInfo) accordingly, and update all call sites (e.g., where evaluateGPUPresence/evaluateArchitecture now pass *Input) to pass the same *Input to gpuHardwareDetected so parameter types remain consistent across these helper functions.
149-162: PassInputby pointer to avoid copying 80 bytes.The static analysis correctly identifies that
Input(80 bytes) is passed by value. Since this function only reads frominput, passing by pointer avoids unnecessary copies and improves efficiency.Proposed fix
-func evaluateGPUPresence(input Input) Check { - if !gpuHardwareDetected(input) { +func evaluateGPUPresence(input *Input) Check { + if !gpuHardwareDetected(input) {This also requires updating
gpuHardwareDetectedto accept*Inputand updating the call site inEvaluate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/precheck/precheck.go` around lines 149 - 162, Change evaluateGPUPresence to accept *Input instead of Input and update gpuHardwareDetected to accept *Input as well to avoid copying the ~80-byte struct; inside evaluateGPUPresence use the pointer parameter for the check and return the same Check values. Also update the call site in Evaluate (where evaluateGPUPresence is invoked) to pass the address of the Input (e.g., &input) so the new pointer signatures are used consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/precheck/precheck.go`:
- Around line 262-272: The detectGPUHardware function currently swallows errors
from listPCIGPUs; update it to log the returned error at debug level before
returning false so operators can diagnose failures. Specifically, in
detectGPUHardware (which calls listPCIGPUs) add a debug log call that includes
the error value and context (e.g., "failed to list PCI GPUs") when err != nil,
and ensure the package imports and uses the project's logging package (add the
appropriate log import) rather than discarding the error.
- Line 164: The function evaluateArchitecture currently takes an Input by value
which copies ~80 bytes; change its signature to accept *Input (pointer) like
evaluateGPUPresence to avoid unnecessary copying, update all call sites that
invoke evaluateArchitecture to pass the address (&inputVar) and adjust any
helper functions or tests that referenced the old signature to accept *Input as
well; ensure no code mutates the input unexpectedly (preserve read-only
behavior) and run build/tests to catch places missed.
- Around line 250-256: Update gpuHardwareDetected to accept a pointer to Input
to match the refactor of evaluateGPUPresence and evaluateArchitecture: change
the signature from func gpuHardwareDetected(input Input) bool to func
gpuHardwareDetected(input *Input) bool, update its body to dereference or use
the pointer fields (input.GPUHardwarePresent and input.GPUInfo) accordingly, and
update all call sites (e.g., where evaluateGPUPresence/evaluateArchitecture now
pass *Input) to pass the same *Input to gpuHardwareDetected so parameter types
remain consistent across these helper functions.
- Around line 149-162: Change evaluateGPUPresence to accept *Input instead of
Input and update gpuHardwareDetected to accept *Input as well to avoid copying
the ~80-byte struct; inside evaluateGPUPresence use the pointer parameter for
the check and return the same Check values. Also update the call site in
Evaluate (where evaluateGPUPresence is invoked) to pass the address of the Input
(e.g., &input) so the new pointer signatures are used consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ef309110-5f36-415f-a1b5-130412c2ee0a
📒 Files selected for processing (2)
internal/precheck/precheck.gointernal/precheck/precheck_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/precheck/precheck_test.go
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/precheck/precheck.go`:
- Around line 266-275: The current detectGPUHardware hides probe failures by
returning false on any listPCIGPUs error; change detectGPUHardware to return
(bool, error) and propagate the underlying error from listPCIGPUs instead of
swallowing it (i.e., call devs, err := listPCIGPUs(ctx); if err != nil { return
false, err } ; return len(devs) > 0, nil). Then update callers (the Input
construction and evaluateGPUPresence flow) to accept the (bool, error) result
and treat an error as "unable to determine GPU presence" (surface the error
upstream) while only treating a nil error with empty slice as the real "no GPU"
case. Ensure references to detectGPUHardware, listPCIGPUs, and
evaluateGPUPresence are updated accordingly.
- Around line 137-143: Evaluate currently dereferences input.NVAttestPresent
while building the checks slice which panics when input is nil; fix by computing
a safe nvAttestPresent variable before constructing checks (e.g.,
nvAttestPresent := false; if input != nil { nvAttestPresent =
input.NVAttestPresent }) and then call evaluateNVAttest(nvAttestPresent) inside
Evaluate so the nil guard in per-check functions can run without the pre-check
panic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 27d3ef74-0cc3-4466-aa83-1aac1e69227c
📒 Files selected for processing (2)
internal/precheck/precheck.gointernal/precheck/precheck_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/precheck/precheck_test.go
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Summary
Testing
go test ./internal/precheck ./cmd/fleetintRelated
Summary by CodeRabbit