Adjust HPA and probes to survive conns storm by thainguyensunya · Pull Request #6053 · BasedHardware/omi

thainguyensunya · 2026-03-26T07:47:24Z

Changes:

Replace tcpSocket by httpGet /v1/health for probes
Instantly scale out (maxReplicas: 60) when the connections increase significantly

greptile-apps · 2026-03-26T07:52:19Z

Greptile Summary

This PR tunes the backend-listen Kubernetes deployment to survive WebSocket connection storms by making two categories of changes: (1) upgrading all health probes from tcpSocket to httpGet /v1/health with more permissive timeouts/thresholds, and (2) aggressively reconfiguring HPA scale-up to double replica count per evaluation cycle with a 30-second stabilization window, raising maxReplicas from 40 to 60. A companion change to the Prometheus adapter removes the response_code="101" WebSocket-only filter from the backend_listen_requests_per_pod metric, enabling the previously-commented-out requestsPerPod: 10 HPA target to operate on all HTTP traffic.\n\nKey changes:\n- All probes switched from tcpSocket → httpGet /v1/health (endpoint confirmed to exist at backend/routers/other.py), with timeoutSeconds increased from 1s → 5s\n- maxReplicas: 40 → 60\n- Scale-up stabilizationWindowSeconds: 120s → 30s; scale-up policy: Percent 30/60s → Percent 100/1s (effectively doubles pods per HPA evaluation cycle, ~15s)\n- requestsPerPod: 10 is now active (was commented out), backed by a Prometheus query that now counts all request types rather than only WebSocket upgrade (101) responses\n- The Pods: 5 scale-up policy is now effectively unreachable given minReplicas: 20 and selectPolicy: Max\n- Including error responses (4xx/5xx) in the backend_listen_requests_per_pod metric means HPA can scale up during error storms, which may or may not be the desired behavior depending on the failure mode

Confidence Score: 4/5

Safe to merge — changes are well-targeted at connection storm resilience with one deliberate trade-off (error-inclusive metric) worth acknowledging.

The probe migration to httpGet is backed by a verified /v1/health endpoint. The aggressive HPA scale-up is intentional and correctly configured. One logic concern exists in the Prometheus adapter: removing the response_code filter means the requestsPerPod metric now counts failures, which could trigger scale-up during non-load failure scenarios — this is a conscious trade-off but has production implications. The redundant Pods policy is a minor cleanup opportunity.

backend/charts/monitoring/prometheus-adapter/prod_omi_prometheus_adapter.yaml — the broadened metric query warrants explicit validation of the requestsPerPod threshold under mixed success/error traffic.

Important Files Changed

Filename	Overview
backend/charts/backend-listen/prod_omi_backend_listen_values.yaml	Replaces `tcpSocket` probes with `httpGet /v1/health`, increases probe timeouts/thresholds for resilience under load, raises maxReplicas to 60, and aggressively tunes scale-up behavior (stabilizationWindowSeconds: 30, periodSeconds: 1, Percent: 100) to handle connection storms. The `Pods: 5` scale-up policy is now redundant given `selectPolicy: Max` with minReplicas: 20.
backend/charts/monitoring/prometheus-adapter/prod_omi_prometheus_adapter.yaml	Removes the `response_code="101"` filter from the `backend_listen_requests_per_pod` query, changing it from WebSocket-upgrade-only traffic to all HTTP traffic. This activates the `requestsPerPod: 10` HPA target and could cause scale-up during error storms since 4xx/5xx responses are now counted.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HPA Evaluation Every ~15s] --> B{Check Metrics}
    B --> C[activeConnectionsPerPod\ntarget: 22]
    B --> D[requestsPerPod\ntarget: 10\nnew - all response codes]
    C --> E{Max desired\nreplicas wins}
    D --> E
    E -->|Scale up needed| F{stabilizationWindow\n30s elapsed?}
    F -->|Yes| G[Apply scale-up policy\n+100% Percent OR +5 Pods\nselectPolicy: Max]
    G --> H[New replica count\ncapped at maxReplicas: 60]
    F -->|No| I[Hold current replicas]

    subgraph Probes [Health Probes - all pods]
        J[kubelet] -->|httpGet /v1/health :8080\nevery 10s timeout 5s| K[Pod]
        K -->|200 OK| J
    end

    subgraph ScaleDown [Scale Down - conservative]
        L[stabilizationWindow: 600s\n-20% per 300s OR -1 pod per 120s\nselectPolicy: Min]
    end

_{Reviews (1): Last reviewed commit: "Adjust HPA and probes to survive conns s..." | Re-trigger Greptile}

greptile-apps · 2026-03-26T07:52:23Z

      - type: Percent
-        value: 30
-        periodSeconds: 60
+        value: 100
+        periodSeconds: 1
      - type: Pods
        value: 5
-        periodSeconds: 60
+        periodSeconds: 1


Pods policy dominated by Percent policy under selectPolicy: Max

With selectPolicy: Max, the HPA always picks the policy that allows the most replicas. Since type: Percent, value: 100 (doubling current pods) will always exceed type: Pods, value: 5 whenever there are more than 5 running pods (which is always the case with minReplicas: 20), the Pods policy is effectively dead code here and will never be selected. If the intent was to ensure at least 5 pods can always be added (as a floor when percentage rounds down on small replica counts), this would only matter below 5 replicas. Consider removing the Pods policy to reduce configuration noise, or document why it's retained.

Suggested change

- type: Percent

value: 30

periodSeconds: 60

value: 100

periodSeconds: 1

- type: Pods

value: 5

periodSeconds: 60

periodSeconds: 1

policies:

- type: Percent

value: 100

periodSeconds: 1

selectPolicy: Max

greptile-apps · 2026-03-26T07:52:24Z

        as: "backend_listen_requests_per_pod"
      seriesQuery: 'stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be"}'
-      metricsQuery: 'avg_over_time((sum(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be",response_code="101"})/sum(kube_deployment_status_replicas{deployment="prod-omi-backend-listen"}))[5m:])'
+      metricsQuery: 'avg_over_time((sum(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be"})/sum(kube_deployment_status_replicas{deployment="prod-omi-backend-listen"}))[5m:])'


Metric now includes error responses — could amplify scaling during failure storms

Removing response_code="101" means backend_listen_requests_per_pod now counts all response codes, including 4xx and 5xx errors. During an error storm (e.g. upstream dependency outage), request failures spike sharply, which would inflate this metric and trigger HPA scale-up. Scaling up when failures are caused by an external dependency or application bug can make the situation worse — adding pods adds no capacity if the bottleneck is downstream.

If the intent is to track total real load (HTTP + WebSocket), consider explicitly filtering to successful responses (2xx + 101) to avoid cascading scale-up during failure events:

metricsQuery: 'avg_over_time((sum(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be",response_code=~"1..|2.."})/sum(kube_deployment_status_replicas{deployment="prod-omi-backend-listen"}))[5m:])'

This is a deliberate trade-off and may be acceptable if the team prefers erring on the side of scaling out; just worth acknowledging explicitly.

beastoin · 2026-03-26T08:11:11Z

lgtm @thainguyensunya

Changes: - Replace **tcpSocket** by **httpGet** **/v1/health** for probes - Instantly scale out (maxReplicas: 60) when the connections increase significantly

Adjust HPA and probes to survive conns storm

f649b67

thainguyensunya requested a review from beastoin March 26, 2026 07:47

greptile-apps Bot reviewed Mar 26, 2026

View reviewed changes

Adjust HPA and probes to survive conns storm

87c2f72

beastoin merged commit 8b3f536 into main Mar 26, 2026
2 checks passed

beastoin deleted the task/adjust-hpa-and-probes-for-surviving-conns-storm branch March 26, 2026 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust HPA and probes to survive conns storm#6053

Adjust HPA and probes to survive conns storm#6053
beastoin merged 2 commits into
mainfrom
task/adjust-hpa-and-probes-for-surviving-conns-storm

thainguyensunya commented Mar 26, 2026

Uh oh!

greptile-apps Bot commented Mar 26, 2026

Uh oh!

greptile-apps Bot Mar 26, 2026

Uh oh!

greptile-apps Bot Mar 26, 2026

Uh oh!

Uh oh!

beastoin commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thainguyensunya commented Mar 26, 2026

Uh oh!

greptile-apps Bot commented Mar 26, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

beastoin commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants