Skip to content

Adjust HPA and probes to survive conns storm#6053

Merged
beastoin merged 2 commits into
mainfrom
task/adjust-hpa-and-probes-for-surviving-conns-storm
Mar 26, 2026
Merged

Adjust HPA and probes to survive conns storm#6053
beastoin merged 2 commits into
mainfrom
task/adjust-hpa-and-probes-for-surviving-conns-storm

Conversation

@thainguyensunya
Copy link
Copy Markdown
Collaborator

Changes:

  • Replace tcpSocket by httpGet /v1/health for probes
  • Instantly scale out (maxReplicas: 60) when the connections increase significantly

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 26, 2026

Greptile Summary

This PR tunes the backend-listen Kubernetes deployment to survive WebSocket connection storms by making two categories of changes: (1) upgrading all health probes from tcpSocket to httpGet /v1/health with more permissive timeouts/thresholds, and (2) aggressively reconfiguring HPA scale-up to double replica count per evaluation cycle with a 30-second stabilization window, raising maxReplicas from 40 to 60. A companion change to the Prometheus adapter removes the response_code="101" WebSocket-only filter from the backend_listen_requests_per_pod metric, enabling the previously-commented-out requestsPerPod: 10 HPA target to operate on all HTTP traffic.\n\nKey changes:\n- All probes switched from tcpSockethttpGet /v1/health (endpoint confirmed to exist at backend/routers/other.py), with timeoutSeconds increased from 1s → 5s\n- maxReplicas: 40 → 60\n- Scale-up stabilizationWindowSeconds: 120s → 30s; scale-up policy: Percent 30/60s → Percent 100/1s (effectively doubles pods per HPA evaluation cycle, ~15s)\n- requestsPerPod: 10 is now active (was commented out), backed by a Prometheus query that now counts all request types rather than only WebSocket upgrade (101) responses\n- The Pods: 5 scale-up policy is now effectively unreachable given minReplicas: 20 and selectPolicy: Max\n- Including error responses (4xx/5xx) in the backend_listen_requests_per_pod metric means HPA can scale up during error storms, which may or may not be the desired behavior depending on the failure mode

Confidence Score: 4/5

Safe to merge — changes are well-targeted at connection storm resilience with one deliberate trade-off (error-inclusive metric) worth acknowledging.

The probe migration to httpGet is backed by a verified /v1/health endpoint. The aggressive HPA scale-up is intentional and correctly configured. One logic concern exists in the Prometheus adapter: removing the response_code filter means the requestsPerPod metric now counts failures, which could trigger scale-up during non-load failure scenarios — this is a conscious trade-off but has production implications. The redundant Pods policy is a minor cleanup opportunity.

backend/charts/monitoring/prometheus-adapter/prod_omi_prometheus_adapter.yaml — the broadened metric query warrants explicit validation of the requestsPerPod threshold under mixed success/error traffic.

Important Files Changed

Filename Overview
backend/charts/backend-listen/prod_omi_backend_listen_values.yaml Replaces tcpSocket probes with httpGet /v1/health, increases probe timeouts/thresholds for resilience under load, raises maxReplicas to 60, and aggressively tunes scale-up behavior (stabilizationWindowSeconds: 30, periodSeconds: 1, Percent: 100) to handle connection storms. The Pods: 5 scale-up policy is now redundant given selectPolicy: Max with minReplicas: 20.
backend/charts/monitoring/prometheus-adapter/prod_omi_prometheus_adapter.yaml Removes the response_code="101" filter from the backend_listen_requests_per_pod query, changing it from WebSocket-upgrade-only traffic to all HTTP traffic. This activates the requestsPerPod: 10 HPA target and could cause scale-up during error storms since 4xx/5xx responses are now counted.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HPA Evaluation Every ~15s] --> B{Check Metrics}
    B --> C[activeConnectionsPerPod\ntarget: 22]
    B --> D[requestsPerPod\ntarget: 10\nnew - all response codes]
    C --> E{Max desired\nreplicas wins}
    D --> E
    E -->|Scale up needed| F{stabilizationWindow\n30s elapsed?}
    F -->|Yes| G[Apply scale-up policy\n+100% Percent OR +5 Pods\nselectPolicy: Max]
    G --> H[New replica count\ncapped at maxReplicas: 60]
    F -->|No| I[Hold current replicas]

    subgraph Probes [Health Probes - all pods]
        J[kubelet] -->|httpGet /v1/health :8080\nevery 10s timeout 5s| K[Pod]
        K -->|200 OK| J
    end

    subgraph ScaleDown [Scale Down - conservative]
        L[stabilizationWindow: 600s\n-20% per 300s OR -1 pod per 120s\nselectPolicy: Min]
    end
Loading

Reviews (1): Last reviewed commit: "Adjust HPA and probes to survive conns s..." | Re-trigger Greptile

Comment on lines +405 to +410
- type: Percent
value: 30
periodSeconds: 60
value: 100
periodSeconds: 1
- type: Pods
value: 5
periodSeconds: 60
periodSeconds: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Pods policy dominated by Percent policy under selectPolicy: Max

With selectPolicy: Max, the HPA always picks the policy that allows the most replicas. Since type: Percent, value: 100 (doubling current pods) will always exceed type: Pods, value: 5 whenever there are more than 5 running pods (which is always the case with minReplicas: 20), the Pods policy is effectively dead code here and will never be selected. If the intent was to ensure at least 5 pods can always be added (as a floor when percentage rounds down on small replica counts), this would only matter below 5 replicas. Consider removing the Pods policy to reduce configuration noise, or document why it's retained.

Suggested change
- type: Percent
value: 30
periodSeconds: 60
value: 100
periodSeconds: 1
- type: Pods
value: 5
periodSeconds: 60
periodSeconds: 1
policies:
- type: Percent
value: 100
periodSeconds: 1
selectPolicy: Max

as: "backend_listen_requests_per_pod"
seriesQuery: 'stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be"}'
metricsQuery: 'avg_over_time((sum(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be",response_code="101"})/sum(kube_deployment_status_replicas{deployment="prod-omi-backend-listen"}))[5m:])'
metricsQuery: 'avg_over_time((sum(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be"})/sum(kube_deployment_status_replicas{deployment="prod-omi-backend-listen"}))[5m:])'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Metric now includes error responses — could amplify scaling during failure storms

Removing response_code="101" means backend_listen_requests_per_pod now counts all response codes, including 4xx and 5xx errors. During an error storm (e.g. upstream dependency outage), request failures spike sharply, which would inflate this metric and trigger HPA scale-up. Scaling up when failures are caused by an external dependency or application bug can make the situation worse — adding pods adds no capacity if the bottleneck is downstream.

If the intent is to track total real load (HTTP + WebSocket), consider explicitly filtering to successful responses (2xx + 101) to avoid cascading scale-up during failure events:

metricsQuery: 'avg_over_time((sum(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be",response_code=~"1..|2.."})/sum(kube_deployment_status_replicas{deployment="prod-omi-backend-listen"}))[5m:])'

This is a deliberate trade-off and may be acceptable if the team prefers erring on the side of scaling out; just worth acknowledging explicitly.

@beastoin beastoin merged commit 8b3f536 into main Mar 26, 2026
2 checks passed
@beastoin beastoin deleted the task/adjust-hpa-and-probes-for-surviving-conns-storm branch March 26, 2026 08:11
@beastoin
Copy link
Copy Markdown
Collaborator

lgtm @thainguyensunya

Glucksberg pushed a commit to Glucksberg/omi-local that referenced this pull request Apr 28, 2026
Changes:
- Replace **tcpSocket** by **httpGet** **/v1/health** for probes
- Instantly scale out (maxReplicas: 60) when the connections increase
significantly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants