Skip to content

K8s ‐ Verify cluster operators available

Elyasaf Halle edited this page May 12, 2026 · 2 revisions

Description

This rule verifies that all OpenShift cluster operators are in Available state, not Degraded, not stuck Progressing, and are Upgradeable. Cluster operators are essential components required for the cluster to function properly — they manage core platform capabilities such as authentication, networking, storage, and the API server.

The rule retrieves all ClusterOperator resources via oc get clusteroperators -o json and checks each operator's conditions:

  1. The Available condition must have status: "True"fails if not
  2. The Degraded condition must not have status: "True"fails if degraded
  3. The Progressing condition should not have status: "True"warns if progressing
  4. The Upgradeable condition should not have status: "False"warns if not upgradeable

If any operator is unavailable or degraded, the rule reports it as failed. If operators are progressing or not upgradeable (but otherwise available and not degraded), the rule reports a warning.

Prerequisites

  • Access to the OpenShift cluster with permissions to list ClusterOperator resources
  • The oc command-line tool configured and authenticated

Impact

If cluster operators are not in Available state:

  • Cluster functionality loss: Core platform features (authentication, networking, ingress, monitoring) may be partially or fully unavailable
  • Workload disruption: Applications relying on cluster services (e.g., image registry, DNS, storage) may fail
  • Upgrade blocking: Unavailable, degraded, or not-upgradeable operators will block cluster upgrades
  • Cascading failures: One unavailable operator can cause dependent operators to degrade

Root Cause

Common scenarios that may lead to unavailable or degraded cluster operators:

  • A failed cluster upgrade that left operators in a transitional state
  • Node failures or reboots that disrupted operator pods
  • Resource exhaustion (CPU, memory, disk) on control plane nodes
  • etcd cluster health issues affecting operator coordination
  • Certificate expiration preventing operator communication
  • Network connectivity issues between control plane components
  • Storage backend failures affecting operators that rely on persistent storage

Diagnostics

1. List all cluster operators and their status

oc get clusteroperators

Look for operators with Available=False, Degraded=True, Progressing=True, or Upgradeable=False.

2. Get detailed status of a specific operator

oc describe clusteroperator <operator-name>

Look for:

  • Conditions section showing Available, Degraded, Progressing, and Upgradeable states
  • Recent events indicating failures or transitions
  • Version information for upgrade-related issues

3. Check operator pods

oc get pods -n openshift-<operator-name> -o wide

Verify operator pods are running and ready on the expected nodes.

4. Check operator logs

oc logs -n openshift-<operator-name> deployment/<operator-name> --tail=100

5. Check cluster events for operator-related issues

oc get events --all-namespaces --sort-by='.lastTimestamp' | grep -i operator

Solution

General approach

  1. Identify the specific operator(s) that are unavailable, degraded, progressing, or not upgradeable from the rule output
  2. Check the operator's conditions and events for the root cause
  3. Address the underlying issue (see specific scenarios below)

If operator is stuck after an upgrade

# Check if any operators are still progressing
oc get clusteroperators | grep -E 'True.*True|False'

# Force operator reconciliation by deleting the operator pod
oc delete pod -n openshift-<operator-name> -l name=<operator-name>

If operator is degraded due to node issues

# Check node status
oc get nodes

# Check if operator pods are scheduled on healthy nodes
oc get pods -n openshift-<operator-name> -o wide

If operator is unavailable due to certificate issues

# Check certificate expiration
oc get secret -n openshift-<operator-name> -o jsonpath='{.items[*].metadata.name}'

# Approve pending CSRs if any
oc get csr | grep Pending
oc adm certificate approve <csr-name>

If operator is not upgradeable

# Check what is blocking the upgrade
oc get clusteroperator <operator-name> -o json | jq '.status.conditions[] | select(.type=="Upgradeable")'

# Review operator logs for upgrade blockers
oc logs -n openshift-<operator-name> deployment/<operator-name> --tail=200 | grep -i upgrade

Verify the fix

# Confirm all operators are Available and not Degraded
oc get clusteroperators

# Verify no operators are Degraded or unavailable
oc get clusteroperators -o json | jq '.items[] | select(.status.conditions[] | select(.type=="Available" and .status!="True")) | .metadata.name'

# Verify no operators are Progressing or not Upgradeable
oc get clusteroperators -o json | jq '.items[] | select(.status.conditions[] | select((.type=="Progressing" and .status=="True") or (.type=="Upgradeable" and .status=="False"))) | .metadata.name'

Resources

Clone this wiki locally