Description
Similar to #2450 (but this time, both pods are stopped)
I found the same problem as last time. Following a small unavailability of the Kubernetes API-Server: the operator can't update its lease.
This behavior leads to a total stop (without the pod restarting) in addition to an abnormal consumption of resources.
Please note that the two pods are on 2 differents nodes. This micro-disruption in API-Server availability only affected one pod at a time (a first replica several days ago on node-1, the other replica recently on node-2). Since you can't set liveness, pods never restart and are stuck in this state
Expected Behavior
I expected there to be a retry until the pod could succeed in its request (or better, could properly stop itself).
Current Behavior
Currently, the operator's two replicas are stopped and no longer respond to anything (creating/deleting a CRD Tenant does nothing). On top of that, the pods will consume 1000m CPU, as if there were a loop in the code that was never exited.
Possible Solution
Add a retry (or fix the problem where the pod can't stop if the API-Server is unavailable).
Steps to Reproduce (for bugs)
- Deploy a cluster with a single controlplane
- Install Minio Operator (ensure both pods are not on the cp)
- Restart the controlplane to create a downtime of api-server
- Operator is stuck
Your Environment
- Version used (minio-operator): quay.io/minio/operator:v7.0.0
- Environment name and version (e.g. kubernetes v1.17.2): v1.32.2
- Server type and version: Talos 1.9.3 (VM)
- Link to your deployment file: https://gist.github.com/qjoly/b96a1509d130d3902ef4957e8dba8d85