Adjust HPA and probes to survive conns storm#6053
Conversation
Greptile SummaryThis PR tunes the Confidence Score: 4/5Safe to merge — changes are well-targeted at connection storm resilience with one deliberate trade-off (error-inclusive metric) worth acknowledging. The probe migration to httpGet is backed by a verified /v1/health endpoint. The aggressive HPA scale-up is intentional and correctly configured. One logic concern exists in the Prometheus adapter: removing the response_code filter means the requestsPerPod metric now counts failures, which could trigger scale-up during non-load failure scenarios — this is a conscious trade-off but has production implications. The redundant Pods policy is a minor cleanup opportunity. backend/charts/monitoring/prometheus-adapter/prod_omi_prometheus_adapter.yaml — the broadened metric query warrants explicit validation of the requestsPerPod threshold under mixed success/error traffic. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[HPA Evaluation Every ~15s] --> B{Check Metrics}
B --> C[activeConnectionsPerPod\ntarget: 22]
B --> D[requestsPerPod\ntarget: 10\nnew - all response codes]
C --> E{Max desired\nreplicas wins}
D --> E
E -->|Scale up needed| F{stabilizationWindow\n30s elapsed?}
F -->|Yes| G[Apply scale-up policy\n+100% Percent OR +5 Pods\nselectPolicy: Max]
G --> H[New replica count\ncapped at maxReplicas: 60]
F -->|No| I[Hold current replicas]
subgraph Probes [Health Probes - all pods]
J[kubelet] -->|httpGet /v1/health :8080\nevery 10s timeout 5s| K[Pod]
K -->|200 OK| J
end
subgraph ScaleDown [Scale Down - conservative]
L[stabilizationWindow: 600s\n-20% per 300s OR -1 pod per 120s\nselectPolicy: Min]
end
Reviews (1): Last reviewed commit: "Adjust HPA and probes to survive conns s..." | Re-trigger Greptile |
| - type: Percent | ||
| value: 30 | ||
| periodSeconds: 60 | ||
| value: 100 | ||
| periodSeconds: 1 | ||
| - type: Pods | ||
| value: 5 | ||
| periodSeconds: 60 | ||
| periodSeconds: 1 |
There was a problem hiding this comment.
Pods policy dominated by Percent policy under selectPolicy: Max
With selectPolicy: Max, the HPA always picks the policy that allows the most replicas. Since type: Percent, value: 100 (doubling current pods) will always exceed type: Pods, value: 5 whenever there are more than 5 running pods (which is always the case with minReplicas: 20), the Pods policy is effectively dead code here and will never be selected. If the intent was to ensure at least 5 pods can always be added (as a floor when percentage rounds down on small replica counts), this would only matter below 5 replicas. Consider removing the Pods policy to reduce configuration noise, or document why it's retained.
| - type: Percent | |
| value: 30 | |
| periodSeconds: 60 | |
| value: 100 | |
| periodSeconds: 1 | |
| - type: Pods | |
| value: 5 | |
| periodSeconds: 60 | |
| periodSeconds: 1 | |
| policies: | |
| - type: Percent | |
| value: 100 | |
| periodSeconds: 1 | |
| selectPolicy: Max |
| as: "backend_listen_requests_per_pod" | ||
| seriesQuery: 'stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be"}' | ||
| metricsQuery: 'avg_over_time((sum(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be",response_code="101"})/sum(kube_deployment_status_replicas{deployment="prod-omi-backend-listen"}))[5m:])' | ||
| metricsQuery: 'avg_over_time((sum(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be"})/sum(kube_deployment_status_replicas{deployment="prod-omi-backend-listen"}))[5m:])' |
There was a problem hiding this comment.
Metric now includes error responses — could amplify scaling during failure storms
Removing response_code="101" means backend_listen_requests_per_pod now counts all response codes, including 4xx and 5xx errors. During an error storm (e.g. upstream dependency outage), request failures spike sharply, which would inflate this metric and trigger HPA scale-up. Scaling up when failures are caused by an external dependency or application bug can make the situation worse — adding pods adds no capacity if the bottleneck is downstream.
If the intent is to track total real load (HTTP + WebSocket), consider explicitly filtering to successful responses (2xx + 101) to avoid cascading scale-up during failure events:
metricsQuery: 'avg_over_time((sum(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_request_count{backend_target_name="custom-domains-api-omi-me-backend-listen-49a4-be",response_code=~"1..|2.."})/sum(kube_deployment_status_replicas{deployment="prod-omi-backend-listen"}))[5m:])'
This is a deliberate trade-off and may be acceptable if the team prefers erring on the side of scaling out; just worth acknowledging explicitly.
|
lgtm @thainguyensunya |
Changes: - Replace **tcpSocket** by **httpGet** **/v1/health** for probes - Instantly scale out (maxReplicas: 60) when the connections increase significantly
Changes: