K8s ‐ Check deployment replicas status

K8s - Verify deployment replica counts

Description

This rule validates that all deployments across all namespaces have the correct number of replicas in a ready and available state. It ensures that the actual replica counts match the desired replica counts specified in the deployment configuration.

The rule examines three key replica metrics for each deployment:

Ready replicas: Pods that have passed readiness probes and are ready to serve traffic
Available replicas: Pods that are available to serve traffic for at least the minimum ready seconds
Updated replicas: Pods that are running the latest template version (indicates rollout completion)

Deployments with mismatched replica counts are flagged as problematic, indicating potential scaling, rollout, or availability issues.

Prerequisites

Access to the OpenShift cluster with permissions to list deployments across all namespaces
The oc command-line tool configured and authenticated

Impact

Deployments with incorrect replica counts can lead to:

Reduced capacity: Fewer replicas than desired results in reduced application capacity
Poor performance: Insufficient replicas to handle traffic load
Single point of failure: Missing replicas eliminate redundancy and high availability
Incomplete rollouts: Stuck deployments prevent new code or configuration from being deployed
Traffic overload: Remaining replicas may be overwhelmed by traffic meant for all replicas
SLA violations: Reduced capacity may breach service level agreements
Failed auto-scaling: HPA (Horizontal Pod Autoscaler) may not function correctly

Critical applications running with fewer replicas than expected are at risk of outages if remaining pods fail.

Root Cause

Common scenarios that may cause replica count mismatches include:

Resource constraints
- Insufficient CPU or memory on nodes to schedule new pods
- Resource quotas preventing pod creation
- Limit ranges blocking pod scheduling
Pod failures
- Pods failing readiness probes
- Containers crashing during startup
- Init containers failing to complete
- Application errors preventing successful pod initialization
Image issues
- Image pull failures due to authentication errors
- Non-existent or deleted container images
- Registry unavailability or network issues
- Wrong image tags or corrupted images
Scheduling problems
- No nodes available with required resources
- Node selectors or affinity rules preventing placement
- Taints on nodes without matching tolerations
- Pod anti-affinity rules limiting placement
Storage issues
- PersistentVolumeClaim binding failures
- Insufficient storage capacity
- Volume mount configuration errors
- StorageClass provisioning failures
Configuration errors
- Missing ConfigMaps or Secrets
- Invalid environment variables
- Incorrect volume configurations
- Security context violations
Rollout issues
- Deployment update stuck in progress
- RollingUpdate strategy blocked by PodDisruptionBudgets
- Insufficient surge capacity during update
- MaxUnavailable settings preventing rollout
Cluster capacity
- Cluster at full capacity
- Node failures reducing available capacity
- Node maintenance or cordoning

Diagnostics

1. Identify deployments with replica issues

# List all deployments with replica counts
oc get deployments --all-namespaces -o wide

# Find deployments where ready != desired
oc get deployments --all-namespaces -o json | jq '.items[] | select(.status.readyReplicas != .spec.replicas) | {namespace: .metadata.namespace, name: .metadata.name, desired: .spec.replicas, ready: .status.readyReplicas, available: .status.availableReplicas, updated: .status.updatedReplicas}'

2. Get detailed deployment information

# Replace <namespace> and <deployment-name> with actual values
oc describe deployment <deployment-name> -n <namespace>

Look for:

Replicas section showing desired/current/ready/available counts
Conditions section (especially Progressing, Available)
Events showing scheduling or scaling issues
ReplicaSet status

3. Check associated ReplicaSets

# List ReplicaSets for the deployment
oc get replicasets -n <namespace> -l app=<deployment-label>

# Describe the current ReplicaSet
oc describe replicaset <replicaset-name> -n <namespace>

4. Examine pod status

# List all pods for the deployment
oc get pods -n <namespace> -l app=<deployment-label> -o wide

# Check for pending or failed pods
oc get pods -n <namespace> -l app=<deployment-label> --field-selector=status.phase!=Running

5. Investigate pod failures

# Describe problematic pods
oc describe pod <pod-name> -n <namespace>

# Check pod logs
oc logs <pod-name> -n <namespace>

# Check previous logs if container crashed
oc logs <pod-name> -n <namespace> --previous

6. Check cluster resource availability

# Check node resources
oc adm top nodes

# Check pod resource usage
oc adm top pods -n <namespace>

# Check resource quotas
oc describe resourcequota -n <namespace>

7. Check for scheduling issues

# Get events for the namespace
oc get events -n <namespace> --sort-by='.lastTimestamp' | grep -i "fail\|error\|warn"

# Check if pods are pending
oc get pods -n <namespace> --field-selector=status.phase=Pending

# Describe pending pods to see scheduling errors
oc describe pod <pending-pod-name> -n <namespace> | grep -A 10 Events

8. Check rollout status

# Check deployment rollout status
oc rollout status deployment/<deployment-name> -n <namespace>

# View rollout history
oc rollout history deployment/<deployment-name> -n <namespace>

Solution

General troubleshooting steps:

For resource constraint issues:

# Check what resources are requested
oc get deployment <deployment-name> -n <namespace> -o yaml | grep -A 10 resources

# Check namespace resource quota
oc describe resourcequota -n <namespace>

# If quota is exhausted, either increase quota or reduce resource requests
oc edit resourcequota <quota-name> -n <namespace>

# Or adjust deployment resource requests
oc set resources deployment/<deployment-name> --limits=cpu=500m,memory=512Mi --requests=cpu=250m,memory=256Mi -n <namespace>

For pod readiness probe failures:

# Check readiness probe configuration
oc get deployment <deployment-name> -n <namespace> -o yaml | grep -A 10 readinessProbe

# Adjust probe timing if needed (increase initialDelaySeconds or periodSeconds)
oc edit deployment <deployment-name> -n <namespace>

# Example: increase initialDelaySeconds from 10 to 30
# Under readinessProbe, change: initialDelaySeconds: 30

For image pull errors:

# Check the image being used
oc get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.template.spec.containers[*].image}'

# Verify image pull secrets exist
oc get secrets -n <namespace> | grep docker

# Update to correct image
oc set image deployment/<deployment-name> container-name=correct-image:tag -n <namespace>

# If authentication is needed, create image pull secret
oc create secret docker-registry <secret-name> \
  --docker-server=<registry> \
  --docker-username=<username> \
  --docker-password=<password> \
  -n <namespace>

For scheduling issues:

# Check if nodes are available
oc get nodes

# Check node capacity
oc describe nodes | grep -A 5 "Allocated resources"

# Remove node taints if blocking (use with caution)
oc adm taint nodes <node-name> <taint-key>-

# Adjust node selector or affinity rules if too restrictive
oc edit deployment <deployment-name> -n <namespace>

For storage/PVC issues:

# Check PVC status
oc get pvc -n <namespace>

# Describe problematic PVC
oc describe pvc <pvc-name> -n <namespace>

# Check if StorageClass exists and is available
oc get storageclass

# Check available persistent volumes
oc get pv

For stuck rollouts:

# Check rollout status
oc rollout status deployment/<deployment-name> -n <namespace> --timeout=60s

# If stuck, try restarting the rollout
oc rollout restart deployment/<deployment-name> -n <namespace>

# If that fails, rollback to previous version
oc rollout undo deployment/<deployment-name> -n <namespace>

# Resume paused rollout
oc rollout resume deployment/<deployment-name> -n <namespace>

For configuration errors:

# Verify ConfigMaps referenced by the deployment exist
oc get configmaps -n <namespace>

# Verify Secrets exist
oc get secrets -n <namespace>

# Check deployment for configuration references
oc get deployment <deployment-name> -n <namespace> -o yaml | grep -A 5 "configMap\|secretRef"

Manual scaling to match desired state:

# Scale deployment to desired replica count
oc scale deployment <deployment-name> --replicas=<desired-count> -n <namespace>

# Wait and verify
oc get deployment <deployment-name> -n <namespace> -w

Force recreation of problematic pods:

# Delete specific failing pods to force recreation
oc delete pod <pod-name> -n <namespace>

# Or delete all pods for the deployment (they will be recreated)
oc delete pods -n <namespace> -l app=<deployment-label>

Verify the fix:

# Check deployment replica status
oc get deployment <deployment-name> -n <namespace>

# Verify the output shows: READY column matches DESIRED
# Example: READY 3/3 means 3 ready out of 3 desired

# Check all pods are running
oc get pods -n <namespace> -l app=<deployment-label>

# Verify replica counts in detail
oc get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.replicas}{" desired, "}{.status.readyReplicas}{" ready, "}{.status.availableReplicas}{" available, "}{.status.updatedReplicas}{" updated\n"}'

Resources

OpenShift Deployments Documentation

K8s ‐ Check deployment replicas status

K8s - Verify deployment replica counts

Description

Prerequisites

Impact

Root Cause

Diagnostics

1. Identify deployments with replica issues

2. Get detailed deployment information

3. Check associated ReplicaSets

4. Examine pod status

5. Investigate pod failures

6. Check cluster resource availability

7. Check for scheduling issues

8. Check rollout status

Solution

General troubleshooting steps:

Manual scaling to match desired state:

Force recreation of problematic pods:

Verify the fix:

Resources

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally