Skip to content

Operator is not resiliant to a short API-Server downtime #2458

Open
@qjoly

Description

@qjoly

Similar to #2450 (but this time, both pods are stopped)

I found the same problem as last time. Following a small unavailability of the Kubernetes API-Server: the operator can't update its lease.
This behavior leads to a total stop (without the pod restarting) in addition to an abnormal consumption of resources.

Please note that the two pods are on 2 differents nodes. This micro-disruption in API-Server availability only affected one pod at a time (a first replica several days ago on node-1, the other replica recently on node-2). Since you can't set liveness, pods never restart and are stuck in this state

Expected Behavior

I expected there to be a retry until the pod could succeed in its request (or better, could properly stop itself).

Current Behavior

Currently, the operator's two replicas are stopped and no longer respond to anything (creating/deleting a CRD Tenant does nothing). On top of that, the pods will consume 1000m CPU, as if there were a loop in the code that was never exited.

Possible Solution

Add a retry (or fix the problem where the pod can't stop if the API-Server is unavailable).

Steps to Reproduce (for bugs)

  1. Deploy a cluster with a single controlplane
  2. Install Minio Operator (ensure both pods are not on the cp)
  3. Restart the controlplane to create a downtime of api-server
  4. Operator is stuck

Your Environment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions