Skip to content

"leader election lost" gpu-operator pod restarts #772

@alnhk

Description

@alnhk

1. Quick Debug Information

We deployed 24.3.0 helm chart gpu-opeator on the kubernetes

  • RHEL 8.9
  • Kernel Version: 4.18.0-513.18.1.el8_9.x86_64
  • Container Runtime Type : CRI-O 1.27.4
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s
  • GPU Operator Version: 24.3.0

2. Issue or feature description

The pod "gpu-operator" crashes with below complete log

{"level":"info","ts":1718417750.9473174,"logger":"controllers.Upgrade","msg":"Upgrades in progress","currently in progress":0,"max parallel upgrades":1,"upgrade slots available":1,"currently unavailable nodes":0,"total number of nodes":2,"maximum nodes that can be unavailable":1}
{"level":"info","ts":1718417750.9473245,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1718417750.9473295,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1718417750.9629183,"logger":"controllers.Upgrade","msg":"ProcessUpgradeRequiredNodes"}
{"level":"info","ts":1718417750.9629385,"logger":"controllers.Upgrade","msg":"ProcessCordonRequiredNodes"}
{"level":"info","ts":1718417750.9629438,"logger":"controllers.Upgrade","msg":"ProcessWaitForJobsRequiredNodes"}
{"level":"info","ts":1718417750.9629483,"logger":"controllers.Upgrade","msg":"ProcessPodDeletionRequiredNodes"}
{"level":"info","ts":1718417750.9629524,"logger":"controllers.Upgrade","msg":"ProcessDrainNodes"}
{"level":"info","ts":1718417750.9629562,"logger":"controllers.Upgrade","msg":"Node drain is disabled by policy, skipping this step"}
{"level":"info","ts":1718417750.962961,"logger":"controllers.Upgrade","msg":"ProcessPodRestartNodes"}
{"level":"info","ts":1718417750.9629655,"logger":"controllers.Upgrade","msg":"Starting Pod Delete"}
{"level":"info","ts":1718417750.9629693,"logger":"controllers.Upgrade","msg":"No pods scheduled to restart"}
{"level":"info","ts":1718417750.9629734,"logger":"controllers.Upgrade","msg":"ProcessUpgradeFailedNodes"}
{"level":"info","ts":1718417750.9629772,"logger":"controllers.Upgrade","msg":"ProcessValidationRequiredNodes"}
{"level":"info","ts":1718417750.9629812,"logger":"controllers.Upgrade","msg":"ProcessUncordonRequiredNodes"}
{"level":"info","ts":1718417750.9629848,"logger":"controllers.Upgrade","msg":"State Manager, finished processing"}
{"level":"info","ts":1718417870.9635167,"logger":"controllers.Upgrade","msg":"Reconciling Upgrade","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1718417870.9635983,"logger":"controllers.Upgrade","msg":"Using label selector","upgrade":{"name":"cluster-policy"},"key":"app","value":"nvidia-driver-daemonset"}
{"level":"info","ts":1718417870.963607,"logger":"controllers.Upgrade","msg":"Building state"}
{"level":"info","ts":1718417870.9777894,"logger":"controllers.Upgrade","msg":"Pod","pod":"nvidia-driver-daemonset-dhmbq","owner":"nvidia-driver-daemonset"}
{"level":"info","ts":1718417870.9778109,"logger":"controllers.Upgrade","msg":"Pod","pod":"nvidia-driver-daemonset-pcb8s","owner":"nvidia-driver-daemonset"}
{"level":"info","ts":1718417870.9778173,"logger":"controllers.Upgrade","msg":"Total orphaned Pods found:","count":0}
{"level":"info","ts":1718417870.983097,"logger":"controllers.Upgrade","msg":"Node hosting a driver pod","node":"**REDRACTED**","state":"upgrade-done"}
{"level":"info","ts":1718417870.988065,"logger":"controllers.Upgrade","msg":"Node hosting a driver pod","node":"**REDRACTED**","state":"upgrade-done"}
{"level":"info","ts":1718417870.9880915,"logger":"controllers.Upgrade","msg":"Propagate state to state manager","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1718417870.988102,"logger":"controllers.Upgrade","msg":"State Manager, got state update"}
{"level":"info","ts":1718417870.988108,"logger":"controllers.Upgrade","msg":"Node states:","Unknown":0,"upgrade-done":2,"upgrade-required":0,"cordon-required":0,"wait-for-jobs-required":0,"pod-deletion-required":0,"upgrade-failed":0,"drain-required":0,"pod-restart-required":0,"validation-required":0,"uncordon-required":0}
{"level":"info","ts":1718417870.9881191,"logger":"controllers.Upgrade","msg":"Upgrades in progress","currently in progress":0,"max parallel upgrades":1,"upgrade slots available":1,"currently unavailable nodes":0,"total number of nodes":2,"maximum nodes that can be unavailable":1}
{"level":"info","ts":1718417870.9881265,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1718417870.9881313,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1718417871.005991,"logger":"controllers.Upgrade","msg":"ProcessUpgradeRequiredNodes"}
{"level":"info","ts":1718417871.0060086,"logger":"controllers.Upgrade","msg":"ProcessCordonRequiredNodes"}
{"level":"info","ts":1718417871.0060134,"logger":"controllers.Upgrade","msg":"ProcessWaitForJobsRequiredNodes"}
{"level":"info","ts":1718417871.0060184,"logger":"controllers.Upgrade","msg":"ProcessPodDeletionRequiredNodes"}
{"level":"info","ts":1718417871.0060227,"logger":"controllers.Upgrade","msg":"ProcessDrainNodes"}
{"level":"info","ts":1718417871.0060265,"logger":"controllers.Upgrade","msg":"Node drain is disabled by policy, skipping this step"}
{"level":"info","ts":1718417871.0060308,"logger":"controllers.Upgrade","msg":"ProcessPodRestartNodes"}
{"level":"info","ts":1718417871.006035,"logger":"controllers.Upgrade","msg":"Starting Pod Delete"}
{"level":"info","ts":1718417871.0060391,"logger":"controllers.Upgrade","msg":"No pods scheduled to restart"}
{"level":"info","ts":1718417871.006043,"logger":"controllers.Upgrade","msg":"ProcessUpgradeFailedNodes"}
{"level":"info","ts":1718417871.0060468,"logger":"controllers.Upgrade","msg":"ProcessValidationRequiredNodes"}
{"level":"info","ts":1718417871.0060506,"logger":"controllers.Upgrade","msg":"ProcessUncordonRequiredNodes"}
{"level":"info","ts":1718417871.0060544,"logger":"controllers.Upgrade","msg":"State Manager, finished processing"}
E0615 02:18:57.955670       1 leaderelection.go:332] error retrieving resource lock gpu-operator/53822513.nvidia.com: Get "https://10.9.240.1:443/apis/coordination.k8s.io/v1/namespaces/gpu-operator/leases/53822513.nvidia.com": context deadline exceeded
I0615 02:18:57.955723       1 leaderelection.go:285] failed to renew lease gpu-operator/53822513.nvidia.com: timed out waiting for the condition
{"level":"error","ts":1718417937.9557931,"logger":"setup","msg":"problem running manager","error":"leader election lost"}

3. Steps to reproduce the issue

The gpu-operator pod is supposed to be online always however received prometheus alert stating "gpu-operator" pod restart.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions