GKE pusher pool at capacity: pods can't scale, health checks failing

## Summary

Pusher pods in `prod-omi-backend` are hitting cluster capacity limits. The HPA wants to scale beyond 24 replicas but new pods can't be scheduled — causing a cycle of health check failures, pod kills, and restarts.

## Evidence (24h window, May 12-13 2026)

**Pod lifecycle events:**
- Unhealthy events: concentrated entirely on `prod-omi-pusher` pods — liveness/readiness probes to `/health:8080` timing out (`context deadline exceeded`)
- Killing events: ~50% pusher, ~40% backend-listen, ~6% vad
- Zero OOMKill events — not a memory issue
- Event log truncated at query limit (high volume)

**FailedScheduling:**
- All scheduling failures target `prod-omi-pusher` pods
- Message: `0/29 nodes available: 11 Insufficient cpu, 11 Insufficient memory, 16 didn't match node affinity, 2 untolerated taints`
- 2 pods currently stuck pending: `prod-omi-pusher-cb886f56d-njwml` and `prod-omi-pusher-cb886f56d-x7vc9`

**Current state:**
- 24/24 pusher pods Running (but at capacity)
- HPA metric at ~97% of scale target (29041m/30), wants to scale to 26+
- 12 nodes in `pusher-pool-v3`, all Ready
- Node CPU usage: 4-21% actual, but **resource requests** fill the scheduler budget

**Resource requests per pusher pod:**
```
requests:
  cpu: 700m
  memory: 4608Mi
limits:
  cpu: 700m
  memory: 4608Mi
```

24 pods × 700m = 16.8 cores requested. 12 nodes can't fit 2 more pods at 700m each.

**Correlated failures:**
- "Upgrade Backend Listen Helm Chart" GitHub Action failed at "Verify rollout" step (May 12 20:32 UTC) — likely because rolling update couldn't schedule temporary extra pods
- No node maintenance events — instability is internal, not GCP-initiated

## Root Cause

The pusher node pool (`pusher-pool-v3`) has run out of schedulable capacity. Pods request 700m CPU + 4.6Gi memory each with requests=limits (no burst). The 12-node pool can host ~24 pods, but scaling beyond that is blocked.

When pods are overloaded, health checks timeout, Kubernetes kills them, tries to reschedule — but can't fit new pods either, creating a churn cycle.

## Suggested Fix

Option A: **Add 2-4 nodes** to `pusher-pool-v3` (quickest)
Option B: **Increase node machine type** (more headroom per node)
Option C: **Reduce resource requests** — actual CPU usage is 4-21% vs 700m requested. If requests were lowered to 400m with 700m limit, the same nodes could fit ~40% more pods.

## Impact

- Pod instability degrades WebSocket connections for wearable users
- Helm chart upgrades can fail if rollout needs headroom
- HPA can't respond to load spikes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE pusher pool at capacity: pods can't scale, health checks failing #7280

Summary

Evidence (24h window, May 12-13 2026)

Root Cause

Suggested Fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GKE pusher pool at capacity: pods can't scale, health checks failing #7280

Description

Summary

Evidence (24h window, May 12-13 2026)

Root Cause

Suggested Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions