Summary
Pusher pods in prod-omi-backend are hitting cluster capacity limits. The HPA wants to scale beyond 24 replicas but new pods can't be scheduled — causing a cycle of health check failures, pod kills, and restarts.
Evidence (24h window, May 12-13 2026)
Pod lifecycle events:
- Unhealthy events: concentrated entirely on
prod-omi-pusher pods — liveness/readiness probes to /health:8080 timing out (context deadline exceeded)
- Killing events: ~50% pusher, ~40% backend-listen, ~6% vad
- Zero OOMKill events — not a memory issue
- Event log truncated at query limit (high volume)
FailedScheduling:
- All scheduling failures target
prod-omi-pusher pods
- Message:
0/29 nodes available: 11 Insufficient cpu, 11 Insufficient memory, 16 didn't match node affinity, 2 untolerated taints
- 2 pods currently stuck pending:
prod-omi-pusher-cb886f56d-njwml and prod-omi-pusher-cb886f56d-x7vc9
Current state:
- 24/24 pusher pods Running (but at capacity)
- HPA metric at ~97% of scale target (29041m/30), wants to scale to 26+
- 12 nodes in
pusher-pool-v3, all Ready
- Node CPU usage: 4-21% actual, but resource requests fill the scheduler budget
Resource requests per pusher pod:
requests:
cpu: 700m
memory: 4608Mi
limits:
cpu: 700m
memory: 4608Mi
24 pods × 700m = 16.8 cores requested. 12 nodes can't fit 2 more pods at 700m each.
Correlated failures:
- "Upgrade Backend Listen Helm Chart" GitHub Action failed at "Verify rollout" step (May 12 20:32 UTC) — likely because rolling update couldn't schedule temporary extra pods
- No node maintenance events — instability is internal, not GCP-initiated
Root Cause
The pusher node pool (pusher-pool-v3) has run out of schedulable capacity. Pods request 700m CPU + 4.6Gi memory each with requests=limits (no burst). The 12-node pool can host ~24 pods, but scaling beyond that is blocked.
When pods are overloaded, health checks timeout, Kubernetes kills them, tries to reschedule — but can't fit new pods either, creating a churn cycle.
Suggested Fix
Option A: Add 2-4 nodes to pusher-pool-v3 (quickest)
Option B: Increase node machine type (more headroom per node)
Option C: Reduce resource requests — actual CPU usage is 4-21% vs 700m requested. If requests were lowered to 400m with 700m limit, the same nodes could fit ~40% more pods.
Impact
- Pod instability degrades WebSocket connections for wearable users
- Helm chart upgrades can fail if rollout needs headroom
- HPA can't respond to load spikes
Summary
Pusher pods in
prod-omi-backendare hitting cluster capacity limits. The HPA wants to scale beyond 24 replicas but new pods can't be scheduled — causing a cycle of health check failures, pod kills, and restarts.Evidence (24h window, May 12-13 2026)
Pod lifecycle events:
prod-omi-pusherpods — liveness/readiness probes to/health:8080timing out (context deadline exceeded)FailedScheduling:
prod-omi-pusherpods0/29 nodes available: 11 Insufficient cpu, 11 Insufficient memory, 16 didn't match node affinity, 2 untolerated taintsprod-omi-pusher-cb886f56d-njwmlandprod-omi-pusher-cb886f56d-x7vc9Current state:
pusher-pool-v3, all ReadyResource requests per pusher pod:
24 pods × 700m = 16.8 cores requested. 12 nodes can't fit 2 more pods at 700m each.
Correlated failures:
Root Cause
The pusher node pool (
pusher-pool-v3) has run out of schedulable capacity. Pods request 700m CPU + 4.6Gi memory each with requests=limits (no burst). The 12-node pool can host ~24 pods, but scaling beyond that is blocked.When pods are overloaded, health checks timeout, Kubernetes kills them, tries to reschedule — but can't fit new pods either, creating a churn cycle.
Suggested Fix
Option A: Add 2-4 nodes to
pusher-pool-v3(quickest)Option B: Increase node machine type (more headroom per node)
Option C: Reduce resource requests — actual CPU usage is 4-21% vs 700m requested. If requests were lowered to 400m with 700m limit, the same nodes could fit ~40% more pods.
Impact