K8s ‐ Verify cluster operators available

Description

This rule verifies that all OpenShift cluster operators are in Available state, not Degraded, not stuck Progressing, and are Upgradeable. Cluster operators are essential components required for the cluster to function properly — they manage core platform capabilities such as authentication, networking, storage, and the API server.

The rule retrieves all ClusterOperator resources via oc get clusteroperators -o json and checks each operator's conditions:

The Available condition must have status: "True" — fails if not
The Degraded condition must not have status: "True" — fails if degraded
The Progressing condition should not have status: "True" — warns if progressing
The Upgradeable condition should not have status: "False" — warns if not upgradeable

If any operator is unavailable or degraded, the rule reports it as failed. If operators are progressing or not upgradeable (but otherwise available and not degraded), the rule reports a warning.

Prerequisites

Access to the OpenShift cluster with permissions to list ClusterOperator resources
The oc command-line tool configured and authenticated

Impact

If cluster operators are not in Available state:

Cluster functionality loss: Core platform features (authentication, networking, ingress, monitoring) may be partially or fully unavailable
Workload disruption: Applications relying on cluster services (e.g., image registry, DNS, storage) may fail
Upgrade blocking: Unavailable, degraded, or not-upgradeable operators will block cluster upgrades
Cascading failures: One unavailable operator can cause dependent operators to degrade

Root Cause

Common scenarios that may lead to unavailable or degraded cluster operators:

A failed cluster upgrade that left operators in a transitional state
Node failures or reboots that disrupted operator pods
Resource exhaustion (CPU, memory, disk) on control plane nodes
etcd cluster health issues affecting operator coordination
Certificate expiration preventing operator communication
Network connectivity issues between control plane components
Storage backend failures affecting operators that rely on persistent storage

Diagnostics

1. List all cluster operators and their status

oc get clusteroperators

Look for operators with Available=False, Degraded=True, Progressing=True, or Upgradeable=False.

2. Get detailed status of a specific operator

oc describe clusteroperator <operator-name>

Look for:

Conditions section showing Available, Degraded, Progressing, and Upgradeable states
Recent events indicating failures or transitions
Version information for upgrade-related issues

3. Check operator pods

oc get pods -n openshift-<operator-name> -o wide

Verify operator pods are running and ready on the expected nodes.

4. Check operator logs

oc logs -n openshift-<operator-name> deployment/<operator-name> --tail=100

5. Check cluster events for operator-related issues

oc get events --all-namespaces --sort-by='.lastTimestamp' | grep -i operator

Solution

General approach

Identify the specific operator(s) that are unavailable, degraded, progressing, or not upgradeable from the rule output
Check the operator's conditions and events for the root cause
Address the underlying issue (see specific scenarios below)

If operator is stuck after an upgrade

# Check if any operators are still progressing
oc get clusteroperators | grep -E 'True.*True|False'

# Force operator reconciliation by deleting the operator pod
oc delete pod -n openshift-<operator-name> -l name=<operator-name>

If operator is degraded due to node issues

# Check node status
oc get nodes

# Check if operator pods are scheduled on healthy nodes
oc get pods -n openshift-<operator-name> -o wide

If operator is unavailable due to certificate issues

# Check certificate expiration
oc get secret -n openshift-<operator-name> -o jsonpath='{.items[*].metadata.name}'

# Approve pending CSRs if any
oc get csr | grep Pending
oc adm certificate approve <csr-name>

If operator is not upgradeable

# Check what is blocking the upgrade
oc get clusteroperator <operator-name> -o json | jq '.status.conditions[] | select(.type=="Upgradeable")'

# Review operator logs for upgrade blockers
oc logs -n openshift-<operator-name> deployment/<operator-name> --tail=200 | grep -i upgrade

Verify the fix

# Confirm all operators are Available and not Degraded
oc get clusteroperators

# Verify no operators are Degraded or unavailable
oc get clusteroperators -o json | jq '.items[] | select(.status.conditions[] | select(.type=="Available" and .status!="True")) | .metadata.name'

# Verify no operators are Progressing or not Upgradeable
oc get clusteroperators -o json | jq '.items[] | select(.status.conditions[] | select((.type=="Progressing" and .status=="True") or (.type=="Upgradeable" and .status=="False"))) | .metadata.name'

K8s ‐ Verify cluster operators available

Description

Prerequisites

Impact

Root Cause

Diagnostics

1. List all cluster operators and their status

2. Get detailed status of a specific operator

3. Check operator pods

4. Check operator logs

5. Check cluster events for operator-related issues

Solution

General approach

If operator is stuck after an upgrade

If operator is degraded due to node issues

If operator is unavailable due to certificate issues

If operator is not upgradeable

Verify the fix

Resources

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally