GPU Operator reconciliation loop failed

Hello Team,
We see the issue of "GPU Operator reconciliation loop failed" in our both Openshift clusters. 

Exact error: GPU Operator reconciliation loop failed for more than 1h; some of its DaemonSet operands might be unable to deploy on some of the GPU-enabled nodes.

We are able to find that Pod "Nvidia-device-plugin-validator-xxxx" is getting failed. We see in the events of pod "Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected"

We see that init container "plugin-validation" is getting restarted again and again. So in pod "nvidia-operator-validator-xxx" other init containers works well. The only issue with plugin-validation.


Note: Currently, there is one gpu node in the cluster with 1GPU. There is already a user pod scheduled on the node with 1GPU requirement and running fine. We see that the init container requires the resource for 1GPU. Maybe this can be the reason. We see this issue several times in both clusters. 

Can you help here?

Other details:
Openshift Version: 4.11.22
GPU Type: g4dn.2xlarge



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Operator reconciliation loop failed #481

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU Operator reconciliation loop failed #481

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions