Skip to content

[Bug]: NRI doesn't work trying to check nvidia-smi from the validation pod instead of host #2382

@frezbo

Description

@frezbo

Describe the bug
When enabling NRI via --set cdi.nriPluginEnabled=true the toolkit validation pod fails with nvidia-smi not found in path.

I've looked at the code and it seems it's trying to execute nvidia-smi from inside the validation container as opposed to how driver validation validates from the host system.

See:

func (t *Toolkit) validate() error {

This executes nvidia-smi from inside the validation container and it would fail, but for driver validation

see:

func validateHostDriver(silent bool) error {

Fix would be to check nvidia-smi similarly to how driver validation works

nvidia-smi is executed from the host

To Reproduce
Install helm chart with --set cdi.nriPluginEnabled=true

Expected behavior
NRI plugin works

Environment (please provide the following information):

  • GPU Operator Version: v26.3.1
  • OS: Talos 1.13.0-rc.0
  • Kernel Version: 6.18.22-talos
  • Container Runtime Version: containerd 2.2.3
  • Kubernetes Distro and Version: Talos v1.35.2

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions