Skip to content

GPU Operator reconciliation loop failed #481

@arpitsharma-vw

Description

@arpitsharma-vw

Hello Team,
We see the issue of "GPU Operator reconciliation loop failed" in our both Openshift clusters.

Exact error: GPU Operator reconciliation loop failed for more than 1h; some of its DaemonSet operands might be unable to deploy on some of the GPU-enabled nodes.

We are able to find that Pod "Nvidia-device-plugin-validator-xxxx" is getting failed. We see in the events of pod "Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected"

We see that init container "plugin-validation" is getting restarted again and again. So in pod "nvidia-operator-validator-xxx" other init containers works well. The only issue with plugin-validation.

Note: Currently, there is one gpu node in the cluster with 1GPU. There is already a user pod scheduled on the node with 1GPU requirement and running fine. We see that the init container requires the resource for 1GPU. Maybe this can be the reason. We see this issue several times in both clusters.

Can you help here?

Other details:
Openshift Version: 4.11.22
GPU Type: g4dn.2xlarge

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions