Skip to content

K8s ‐ Verify deployments availability

yogeshahiray edited this page Apr 16, 2026 · 1 revision

K8s - Verify all deployments are available

Description

This rule validates that all deployments across all namespaces are available and ready. Deployments that lack an Available condition or have a non-True status are flagged as unavailable.

The rule queries all deployment objects in the cluster and examines their status conditions to identify any deployments that are not in a healthy, available state.

Prerequisites

  • Access to the OpenShift cluster with permissions to list deployments across all namespaces
  • The oc command-line tool configured and authenticated

Impact

Unavailable deployments can lead to:

  • Application downtime: Services may become inaccessible to users
  • Degraded performance: Reduced capacity if some replicas are unavailable
  • Loss of redundancy: High availability guarantees may be compromised

Root Cause

Common scenarios that may cause deployments to become unavailable include:

  1. Resource constraints

    • Insufficient CPU or memory resources on nodes
    • Resource quota limits exceeded in the namespace
    • Pod eviction due to resource pressure
  2. Image-related issues

    • Image pull failures (invalid image name, authentication issues, registry unavailable)
    • Missing or deleted container images
    • Incompatible image architecture
  3. Configuration errors

    • Invalid environment variables or configuration maps
    • Missing secrets required by the deployment
    • Incorrect volume mount configurations
    • Invalid container command or arguments
  4. Pod failures

    • Application crashes or startup failures
    • Failed readiness or liveness probes
    • Init container failures
    • Persistent volume claim (PVC) binding issues
  5. Node issues

    • Node failures or evictions
    • Taints/tolerations preventing pod scheduling
    • Affinity/anti-affinity rules blocking placement
  6. Network problems

    • DNS resolution failures
    • Network policy blocking traffic
    • Service connectivity issues
  7. Rollout issues

    • Failed deployment updates
    • Rollout stuck in progress
    • Insufficient replicas during rolling update

Diagnostics

1. Identify unavailable deployments

# List all deployments with their availability status
oc get deployments --all-namespaces

# Get detailed deployment status in JSON format
oc get deployments --all-namespaces -o json | jq '.items[] | select(.status.conditions[] | select(.type=="Available" and .status!="True")) | {namespace: .metadata.namespace, name: .metadata.name, conditions: .status.conditions}'

2. Describe the problematic deployment

# Replace <namespace> and <deployment-name> with actual values
oc describe deployment <deployment-name> -n <namespace>

Look for:

  • Conditions section (especially Available, Progressing, ReplicaFailure)
  • Events showing errors or warnings
  • Replica status (desired vs ready vs available)

3. Check pod status

# List pods for the deployment
oc get pods -n <namespace> -l app=<deployment-label>

# Describe problematic pods
oc describe pod <pod-name> -n <namespace>

4. Review pod logs

# Check current pod logs
oc logs <pod-name> -n <namespace>

# Check previous pod logs (if pod crashed)
oc logs <pod-name> -n <namespace> --previous

5. Check events

# View recent events in the namespace
oc get events -n <namespace> --sort-by='.lastTimestamp'

6. Check resource availability

# Check node resources
oc adm top nodes

# Check pod resource requests/limits
oc describe deployment <deployment-name> -n <namespace> | grep -A 5 "Limits\|Requests"

Solution

General troubleshooting steps:

  1. For image pull errors:

    # Verify the image exists and is accessible
    oc describe deployment <deployment-name> -n <namespace> | grep Image
    
    # Check image pull secrets
    oc get secrets -n <namespace>
    
    # Update deployment with correct image or credentials
    oc set image deployment/<deployment-name> container-name=new-image:tag -n <namespace>
  2. For resource constraints:

    # Check resource quotas
    oc describe resourcequota -n <namespace>
    
    # Check limit ranges
    oc describe limitrange -n <namespace>
    
    # Adjust resource requests/limits if needed
    oc set resources deployment/<deployment-name> --limits=cpu=500m,memory=512Mi --requests=cpu=250m,memory=256Mi -n <namespace>
  3. For configuration errors:

    # Verify ConfigMaps exist
    oc get configmaps -n <namespace>
    
    # Verify Secrets exist
    oc get secrets -n <namespace>
    
    # Edit deployment to fix configuration
    oc edit deployment <deployment-name> -n <namespace>
  4. For failed readiness/liveness probes:

    # Check probe configuration
    oc get deployment <deployment-name> -n <namespace> -o yaml | grep -A 10 "livenessProbe\|readinessProbe"
    
    # Adjust probe timing or endpoints as needed
    oc edit deployment <deployment-name> -n <namespace>
  5. For rollout issues:

    # Check rollout status
    oc rollout status deployment/<deployment-name> -n <namespace>
    
    # Rollback to previous version if needed
    oc rollout undo deployment/<deployment-name> -n <namespace>
    
    # Pause rollout to investigate
    oc rollout pause deployment/<deployment-name> -n <namespace>
  6. For node scheduling issues:

    # Check node status
    oc get nodes
    
    # Check pod scheduling events
    oc describe pod <pod-name> -n <namespace> | grep -A 10 Events
    
    # Remove taints if blocking (use with caution)
    oc adm taint nodes <node-name> <taint-key>-

Verify the fix:

# Check deployment status
oc get deployment <deployment-name> -n <namespace>

# Verify all replicas are ready
oc get pods -n <namespace> -l app=<deployment-label>

# Check deployment conditions
oc get deployment <deployment-name> -n <namespace> -o jsonpath='{.status.conditions[?(@.type=="Available")]}'

Resources

Clone this wiki locally