-
Notifications
You must be signed in to change notification settings - Fork 413
Closed
Description
1. Quick Debug Information
We deployed 24.3.0 helm chart gpu-opeator on the kubernetes
- RHEL 8.9
- Kernel Version: 4.18.0-513.18.1.el8_9.x86_64
- Container Runtime Type : CRI-O 1.27.4
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s
- GPU Operator Version: 24.3.0
2. Issue or feature description
The pod "gpu-operator" crashes with below complete log
{"level":"info","ts":1718417750.9473174,"logger":"controllers.Upgrade","msg":"Upgrades in progress","currently in progress":0,"max parallel upgrades":1,"upgrade slots available":1,"currently unavailable nodes":0,"total number of nodes":2,"maximum nodes that can be unavailable":1}
{"level":"info","ts":1718417750.9473245,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1718417750.9473295,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1718417750.9629183,"logger":"controllers.Upgrade","msg":"ProcessUpgradeRequiredNodes"}
{"level":"info","ts":1718417750.9629385,"logger":"controllers.Upgrade","msg":"ProcessCordonRequiredNodes"}
{"level":"info","ts":1718417750.9629438,"logger":"controllers.Upgrade","msg":"ProcessWaitForJobsRequiredNodes"}
{"level":"info","ts":1718417750.9629483,"logger":"controllers.Upgrade","msg":"ProcessPodDeletionRequiredNodes"}
{"level":"info","ts":1718417750.9629524,"logger":"controllers.Upgrade","msg":"ProcessDrainNodes"}
{"level":"info","ts":1718417750.9629562,"logger":"controllers.Upgrade","msg":"Node drain is disabled by policy, skipping this step"}
{"level":"info","ts":1718417750.962961,"logger":"controllers.Upgrade","msg":"ProcessPodRestartNodes"}
{"level":"info","ts":1718417750.9629655,"logger":"controllers.Upgrade","msg":"Starting Pod Delete"}
{"level":"info","ts":1718417750.9629693,"logger":"controllers.Upgrade","msg":"No pods scheduled to restart"}
{"level":"info","ts":1718417750.9629734,"logger":"controllers.Upgrade","msg":"ProcessUpgradeFailedNodes"}
{"level":"info","ts":1718417750.9629772,"logger":"controllers.Upgrade","msg":"ProcessValidationRequiredNodes"}
{"level":"info","ts":1718417750.9629812,"logger":"controllers.Upgrade","msg":"ProcessUncordonRequiredNodes"}
{"level":"info","ts":1718417750.9629848,"logger":"controllers.Upgrade","msg":"State Manager, finished processing"}
{"level":"info","ts":1718417870.9635167,"logger":"controllers.Upgrade","msg":"Reconciling Upgrade","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1718417870.9635983,"logger":"controllers.Upgrade","msg":"Using label selector","upgrade":{"name":"cluster-policy"},"key":"app","value":"nvidia-driver-daemonset"}
{"level":"info","ts":1718417870.963607,"logger":"controllers.Upgrade","msg":"Building state"}
{"level":"info","ts":1718417870.9777894,"logger":"controllers.Upgrade","msg":"Pod","pod":"nvidia-driver-daemonset-dhmbq","owner":"nvidia-driver-daemonset"}
{"level":"info","ts":1718417870.9778109,"logger":"controllers.Upgrade","msg":"Pod","pod":"nvidia-driver-daemonset-pcb8s","owner":"nvidia-driver-daemonset"}
{"level":"info","ts":1718417870.9778173,"logger":"controllers.Upgrade","msg":"Total orphaned Pods found:","count":0}
{"level":"info","ts":1718417870.983097,"logger":"controllers.Upgrade","msg":"Node hosting a driver pod","node":"**REDRACTED**","state":"upgrade-done"}
{"level":"info","ts":1718417870.988065,"logger":"controllers.Upgrade","msg":"Node hosting a driver pod","node":"**REDRACTED**","state":"upgrade-done"}
{"level":"info","ts":1718417870.9880915,"logger":"controllers.Upgrade","msg":"Propagate state to state manager","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1718417870.988102,"logger":"controllers.Upgrade","msg":"State Manager, got state update"}
{"level":"info","ts":1718417870.988108,"logger":"controllers.Upgrade","msg":"Node states:","Unknown":0,"upgrade-done":2,"upgrade-required":0,"cordon-required":0,"wait-for-jobs-required":0,"pod-deletion-required":0,"upgrade-failed":0,"drain-required":0,"pod-restart-required":0,"validation-required":0,"uncordon-required":0}
{"level":"info","ts":1718417870.9881191,"logger":"controllers.Upgrade","msg":"Upgrades in progress","currently in progress":0,"max parallel upgrades":1,"upgrade slots available":1,"currently unavailable nodes":0,"total number of nodes":2,"maximum nodes that can be unavailable":1}
{"level":"info","ts":1718417870.9881265,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1718417870.9881313,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1718417871.005991,"logger":"controllers.Upgrade","msg":"ProcessUpgradeRequiredNodes"}
{"level":"info","ts":1718417871.0060086,"logger":"controllers.Upgrade","msg":"ProcessCordonRequiredNodes"}
{"level":"info","ts":1718417871.0060134,"logger":"controllers.Upgrade","msg":"ProcessWaitForJobsRequiredNodes"}
{"level":"info","ts":1718417871.0060184,"logger":"controllers.Upgrade","msg":"ProcessPodDeletionRequiredNodes"}
{"level":"info","ts":1718417871.0060227,"logger":"controllers.Upgrade","msg":"ProcessDrainNodes"}
{"level":"info","ts":1718417871.0060265,"logger":"controllers.Upgrade","msg":"Node drain is disabled by policy, skipping this step"}
{"level":"info","ts":1718417871.0060308,"logger":"controllers.Upgrade","msg":"ProcessPodRestartNodes"}
{"level":"info","ts":1718417871.006035,"logger":"controllers.Upgrade","msg":"Starting Pod Delete"}
{"level":"info","ts":1718417871.0060391,"logger":"controllers.Upgrade","msg":"No pods scheduled to restart"}
{"level":"info","ts":1718417871.006043,"logger":"controllers.Upgrade","msg":"ProcessUpgradeFailedNodes"}
{"level":"info","ts":1718417871.0060468,"logger":"controllers.Upgrade","msg":"ProcessValidationRequiredNodes"}
{"level":"info","ts":1718417871.0060506,"logger":"controllers.Upgrade","msg":"ProcessUncordonRequiredNodes"}
{"level":"info","ts":1718417871.0060544,"logger":"controllers.Upgrade","msg":"State Manager, finished processing"}
E0615 02:18:57.955670 1 leaderelection.go:332] error retrieving resource lock gpu-operator/53822513.nvidia.com: Get "https://10.9.240.1:443/apis/coordination.k8s.io/v1/namespaces/gpu-operator/leases/53822513.nvidia.com": context deadline exceeded
I0615 02:18:57.955723 1 leaderelection.go:285] failed to renew lease gpu-operator/53822513.nvidia.com: timed out waiting for the condition
{"level":"error","ts":1718417937.9557931,"logger":"setup","msg":"problem running manager","error":"leader election lost"}
3. Steps to reproduce the issue
The gpu-operator pod is supposed to be online always however received prometheus alert stating "gpu-operator" pod restart.
Metadata
Metadata
Assignees
Labels
No labels