Skip to content

GKE pusher pool at capacity: pods can't scale, health checks failing #7280

@beastoin

Description

@beastoin

Summary

Pusher pods in prod-omi-backend are hitting cluster capacity limits. The HPA wants to scale beyond 24 replicas but new pods can't be scheduled — causing a cycle of health check failures, pod kills, and restarts.

Evidence (24h window, May 12-13 2026)

Pod lifecycle events:

  • Unhealthy events: concentrated entirely on prod-omi-pusher pods — liveness/readiness probes to /health:8080 timing out (context deadline exceeded)
  • Killing events: ~50% pusher, ~40% backend-listen, ~6% vad
  • Zero OOMKill events — not a memory issue
  • Event log truncated at query limit (high volume)

FailedScheduling:

  • All scheduling failures target prod-omi-pusher pods
  • Message: 0/29 nodes available: 11 Insufficient cpu, 11 Insufficient memory, 16 didn't match node affinity, 2 untolerated taints
  • 2 pods currently stuck pending: prod-omi-pusher-cb886f56d-njwml and prod-omi-pusher-cb886f56d-x7vc9

Current state:

  • 24/24 pusher pods Running (but at capacity)
  • HPA metric at ~97% of scale target (29041m/30), wants to scale to 26+
  • 12 nodes in pusher-pool-v3, all Ready
  • Node CPU usage: 4-21% actual, but resource requests fill the scheduler budget

Resource requests per pusher pod:

requests:
  cpu: 700m
  memory: 4608Mi
limits:
  cpu: 700m
  memory: 4608Mi

24 pods × 700m = 16.8 cores requested. 12 nodes can't fit 2 more pods at 700m each.

Correlated failures:

  • "Upgrade Backend Listen Helm Chart" GitHub Action failed at "Verify rollout" step (May 12 20:32 UTC) — likely because rolling update couldn't schedule temporary extra pods
  • No node maintenance events — instability is internal, not GCP-initiated

Root Cause

The pusher node pool (pusher-pool-v3) has run out of schedulable capacity. Pods request 700m CPU + 4.6Gi memory each with requests=limits (no burst). The 12-node pool can host ~24 pods, but scaling beyond that is blocked.

When pods are overloaded, health checks timeout, Kubernetes kills them, tries to reschedule — but can't fit new pods either, creating a churn cycle.

Suggested Fix

Option A: Add 2-4 nodes to pusher-pool-v3 (quickest)
Option B: Increase node machine type (more headroom per node)
Option C: Reduce resource requests — actual CPU usage is 4-21% vs 700m requested. If requests were lowered to 400m with 700m limit, the same nodes could fit ~40% more pods.

Impact

  • Pod instability degrades WebSocket connections for wearable users
  • Helm chart upgrades can fail if rollout needs headroom
  • HPA can't respond to load spikes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingp2Priority: Important (score 14-21)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions