-
Notifications
You must be signed in to change notification settings - Fork 10
K8s ‐ Check deployment replicas status
This rule validates that all deployments across all namespaces have the correct number of replicas in a ready and available state. It ensures that the actual replica counts match the desired replica counts specified in the deployment configuration.
The rule examines three key replica metrics for each deployment:
- Ready replicas: Pods that have passed readiness probes and are ready to serve traffic
- Available replicas: Pods that are available to serve traffic for at least the minimum ready seconds
- Updated replicas: Pods that are running the latest template version (indicates rollout completion)
Deployments with mismatched replica counts are flagged as problematic, indicating potential scaling, rollout, or availability issues.
- Access to the OpenShift cluster with permissions to list deployments across all namespaces
- The
occommand-line tool configured and authenticated
Deployments with incorrect replica counts can lead to:
- Reduced capacity: Fewer replicas than desired results in reduced application capacity
- Poor performance: Insufficient replicas to handle traffic load
- Single point of failure: Missing replicas eliminate redundancy and high availability
- Incomplete rollouts: Stuck deployments prevent new code or configuration from being deployed
- Traffic overload: Remaining replicas may be overwhelmed by traffic meant for all replicas
- SLA violations: Reduced capacity may breach service level agreements
- Failed auto-scaling: HPA (Horizontal Pod Autoscaler) may not function correctly
Critical applications running with fewer replicas than expected are at risk of outages if remaining pods fail.
Common scenarios that may cause replica count mismatches include:
-
Resource constraints
- Insufficient CPU or memory on nodes to schedule new pods
- Resource quotas preventing pod creation
- Limit ranges blocking pod scheduling
-
Pod failures
- Pods failing readiness probes
- Containers crashing during startup
- Init containers failing to complete
- Application errors preventing successful pod initialization
-
Image issues
- Image pull failures due to authentication errors
- Non-existent or deleted container images
- Registry unavailability or network issues
- Wrong image tags or corrupted images
-
Scheduling problems
- No nodes available with required resources
- Node selectors or affinity rules preventing placement
- Taints on nodes without matching tolerations
- Pod anti-affinity rules limiting placement
-
Storage issues
- PersistentVolumeClaim binding failures
- Insufficient storage capacity
- Volume mount configuration errors
- StorageClass provisioning failures
-
Configuration errors
- Missing ConfigMaps or Secrets
- Invalid environment variables
- Incorrect volume configurations
- Security context violations
-
Rollout issues
- Deployment update stuck in progress
- RollingUpdate strategy blocked by PodDisruptionBudgets
- Insufficient surge capacity during update
- MaxUnavailable settings preventing rollout
-
Cluster capacity
- Cluster at full capacity
- Node failures reducing available capacity
- Node maintenance or cordoning
# List all deployments with replica counts
oc get deployments --all-namespaces -o wide
# Find deployments where ready != desired
oc get deployments --all-namespaces -o json | jq '.items[] | select(.status.readyReplicas != .spec.replicas) | {namespace: .metadata.namespace, name: .metadata.name, desired: .spec.replicas, ready: .status.readyReplicas, available: .status.availableReplicas, updated: .status.updatedReplicas}'# Replace <namespace> and <deployment-name> with actual values
oc describe deployment <deployment-name> -n <namespace>Look for:
- Replicas section showing desired/current/ready/available counts
- Conditions section (especially Progressing, Available)
- Events showing scheduling or scaling issues
- ReplicaSet status
# List ReplicaSets for the deployment
oc get replicasets -n <namespace> -l app=<deployment-label>
# Describe the current ReplicaSet
oc describe replicaset <replicaset-name> -n <namespace># List all pods for the deployment
oc get pods -n <namespace> -l app=<deployment-label> -o wide
# Check for pending or failed pods
oc get pods -n <namespace> -l app=<deployment-label> --field-selector=status.phase!=Running# Describe problematic pods
oc describe pod <pod-name> -n <namespace>
# Check pod logs
oc logs <pod-name> -n <namespace>
# Check previous logs if container crashed
oc logs <pod-name> -n <namespace> --previous# Check node resources
oc adm top nodes
# Check pod resource usage
oc adm top pods -n <namespace>
# Check resource quotas
oc describe resourcequota -n <namespace># Get events for the namespace
oc get events -n <namespace> --sort-by='.lastTimestamp' | grep -i "fail\|error\|warn"
# Check if pods are pending
oc get pods -n <namespace> --field-selector=status.phase=Pending
# Describe pending pods to see scheduling errors
oc describe pod <pending-pod-name> -n <namespace> | grep -A 10 Events# Check deployment rollout status
oc rollout status deployment/<deployment-name> -n <namespace>
# View rollout history
oc rollout history deployment/<deployment-name> -n <namespace>-
For resource constraint issues:
# Check what resources are requested oc get deployment <deployment-name> -n <namespace> -o yaml | grep -A 10 resources # Check namespace resource quota oc describe resourcequota -n <namespace> # If quota is exhausted, either increase quota or reduce resource requests oc edit resourcequota <quota-name> -n <namespace> # Or adjust deployment resource requests oc set resources deployment/<deployment-name> --limits=cpu=500m,memory=512Mi --requests=cpu=250m,memory=256Mi -n <namespace>
-
For pod readiness probe failures:
# Check readiness probe configuration oc get deployment <deployment-name> -n <namespace> -o yaml | grep -A 10 readinessProbe # Adjust probe timing if needed (increase initialDelaySeconds or periodSeconds) oc edit deployment <deployment-name> -n <namespace> # Example: increase initialDelaySeconds from 10 to 30 # Under readinessProbe, change: initialDelaySeconds: 30
-
For image pull errors:
# Check the image being used oc get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.template.spec.containers[*].image}' # Verify image pull secrets exist oc get secrets -n <namespace> | grep docker # Update to correct image oc set image deployment/<deployment-name> container-name=correct-image:tag -n <namespace> # If authentication is needed, create image pull secret oc create secret docker-registry <secret-name> \ --docker-server=<registry> \ --docker-username=<username> \ --docker-password=<password> \ -n <namespace>
-
For scheduling issues:
# Check if nodes are available oc get nodes # Check node capacity oc describe nodes | grep -A 5 "Allocated resources" # Remove node taints if blocking (use with caution) oc adm taint nodes <node-name> <taint-key>- # Adjust node selector or affinity rules if too restrictive oc edit deployment <deployment-name> -n <namespace>
-
For storage/PVC issues:
# Check PVC status oc get pvc -n <namespace> # Describe problematic PVC oc describe pvc <pvc-name> -n <namespace> # Check if StorageClass exists and is available oc get storageclass # Check available persistent volumes oc get pv
-
For stuck rollouts:
# Check rollout status oc rollout status deployment/<deployment-name> -n <namespace> --timeout=60s # If stuck, try restarting the rollout oc rollout restart deployment/<deployment-name> -n <namespace> # If that fails, rollback to previous version oc rollout undo deployment/<deployment-name> -n <namespace> # Resume paused rollout oc rollout resume deployment/<deployment-name> -n <namespace>
-
For configuration errors:
# Verify ConfigMaps referenced by the deployment exist oc get configmaps -n <namespace> # Verify Secrets exist oc get secrets -n <namespace> # Check deployment for configuration references oc get deployment <deployment-name> -n <namespace> -o yaml | grep -A 5 "configMap\|secretRef"
# Scale deployment to desired replica count
oc scale deployment <deployment-name> --replicas=<desired-count> -n <namespace>
# Wait and verify
oc get deployment <deployment-name> -n <namespace> -w# Delete specific failing pods to force recreation
oc delete pod <pod-name> -n <namespace>
# Or delete all pods for the deployment (they will be recreated)
oc delete pods -n <namespace> -l app=<deployment-label># Check deployment replica status
oc get deployment <deployment-name> -n <namespace>
# Verify the output shows: READY column matches DESIRED
# Example: READY 3/3 means 3 ready out of 3 desired
# Check all pods are running
oc get pods -n <namespace> -l app=<deployment-label>
# Verify replica counts in detail
oc get deployment <deployment-name> -n <namespace> -o jsonpath='{.spec.replicas}{" desired, "}{.status.readyReplicas}{" ready, "}{.status.availableReplicas}{" available, "}{.status.updatedReplicas}{" updated\n"}'