Skip to content

Use backend-listen active WS connections for autoscaling#5595

Merged
beastoin merged 1 commit intomainfrom
task/use-backend-listen-active-ws-for-auto-scaling
Mar 13, 2026
Merged

Use backend-listen active WS connections for autoscaling#5595
beastoin merged 1 commit intomainfrom
task/use-backend-listen-active-ws-for-auto-scaling

Conversation

@thainguyensunya
Copy link
Copy Markdown
Collaborator

Changes:

  • Use backend_listen_active_ws_connections_per_pod for autoscaling
  • Modify scaleUp and scaleDown behaviors to be compatible the new autoscaling factor

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 13, 2026

Greptile Summary

This PR migrates the backend-listen HPA autoscaling signal from HTTP request-rate (requestsPerPod) to the number of active WebSocket connections per pod (activeConnectionsPerPod: 20). The new Prometheus Adapter rule (avg(backend_listen_active_ws_connections)) is registered in both dev and prod adapters, and the HPA template now supports the new external metric. Scaling behavior is tuned to be more conservative (slower scale-up and scale-down).

Key changes and concerns:

  • Scale-up stabilization window increased from 30s → 120s: For a real-time audio/WebSocket workload this 4× increase means the HPA waits 2 minutes before reacting to connection spikes. This could result in pods being over-loaded during traffic surges.
  • avg() metric is sensitive to scrape gaps: avg(backend_listen_active_ws_connections) will be computed over fewer samples if any pod temporarily stops reporting, transiently inflating the metric and potentially causing unnecessary scale-up events.
  • Dev HPA does not enable the new metric: dev_omi_backend_listen_values.yaml does not include activeConnectionsPerPod, so the full prometheus-adapter → HPA pipeline for this new signal is untested in the dev environment before it goes to production.

Confidence Score: 3/5

  • Mostly safe to merge, but the 4× increase in scale-up stabilization window and lack of dev-environment validation of the new metric signal introduce production risk for a latency-sensitive WebSocket service.
  • The core idea (scaling on active WS connections) is sound and the implementation is technically correct. The Prometheus Adapter avg() + HPA type: Value pairing is valid. However, the scale-up stabilization window change (30s → 120s) is a significant behavior regression for WebSocket spike scenarios, and the new autoscaling signal is being deployed directly to production without being exercised in the dev environment first.
  • backend/charts/backend-listen/prod_omi_backend_listen_values.yaml — specifically the scale-up stabilizationWindowSeconds and the absence of dev validation.

Important Files Changed

Filename Overview
backend/charts/backend-listen/prod_omi_backend_listen_values.yaml Switches autoscaling from requestsPerPod (HTTP/WS upgrade request rate) to activeConnectionsPerPod (live WS gauge at 20). Scale-up stabilization window increased 4x to 120s, and both scale-up policies slowed significantly — these changes may delay capacity response during connection spikes.
backend/charts/backend-listen/templates/hpa.yaml Adds new HPA metric block for backend_listen_active_ws_connections_per_pod using type: Value — correct pairing with the avg()-based prometheus adapter query. Template logic is straightforward.
backend/charts/monitoring/prometheus-adapter/prod_omi_prometheus_adapter.yaml Registers backend_listen_active_ws_connections_per_pod external metric using avg(backend_listen_active_ws_connections). The avg() approach is functionally correct but susceptible to producing an inflated value if any pod temporarily drops out of Prometheus scraping.
backend/charts/monitoring/prometheus-adapter/dev_omi_prometheus_adapter.yaml Mirrors prod prometheus adapter change — registers the backend_listen_active_ws_connections_per_pod metric rule. Dev backend-listen values do not yet enable this metric in the HPA, so there is no dev-env validation of the full pipeline.

Sequence Diagram

sequenceDiagram
    participant Pod as backend-listen Pod
    participant Prom as Prometheus
    participant Adapter as Prometheus Adapter
    participant HPA as Kubernetes HPA

    Pod->>Prom: Expose /metrics<br/>backend_listen_active_ws_connections (Gauge)
    Note over Pod,Prom: ACTIVE_WS_CONNECTIONS.inc() on connect<br/>ACTIVE_WS_CONNECTIONS.dec() on disconnect

    Prom->>Adapter: Scrape raw metric<br/>backend_listen_active_ws_connections
    Note over Adapter: metricsQuery: avg(backend_listen_active_ws_connections)<br/>Exposed as: backend_listen_active_ws_connections_per_pod

    HPA->>Adapter: GET /apis/external.metrics.k8s.io/v1beta1/<br/>backend_listen_active_ws_connections_per_pod
    Adapter-->>HPA: Return avg value (e.g. 25)

    Note over HPA: type: Value, target: 20<br/>stabilizationWindowSeconds: 120 (scale-up)<br/>desiredReplicas = ceil(current * 25/20)

    HPA->>Pod: Scale deployment<br/>(up to maxReplicas: 50, down to minReplicas: 26)
Loading

Last reviewed commit: cb9be57

selectPolicy: Min
scaleUp:
stabilizationWindowSeconds: 30
stabilizationWindowSeconds: 120
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scale-up stabilization window increased 4x without justification

stabilizationWindowSeconds for scale-up was increased from 30 to 120. For a WebSocket server, this means the HPA will observe metric values for 2 full minutes before issuing any scale-up decision. If there is a sudden influx of WebSocket connections, all existing pods remain under load during this window.

The previous value of 30 seconds was already quite defensive for active WS connection spikes. At 26 minimum pods and 20 connections/pod as the target, a connection surge would push the metric over the threshold quickly, but scaling would be deferred for 2 minutes, during which each pod may be handling significantly more than the target 20 connections.

Consider reverting to a shorter stabilization window (e.g. 30–60s) for scale-up, or document the reasoning (e.g. if there is evidence of flapping/oscillation that justifies this delay).

Suggested change
stabilizationWindowSeconds: 120
stabilizationWindowSeconds: 60

Comment on lines +6 to +12
- name:
as: "backend_listen_active_ws_connections_per_pod"
seriesQuery: 'backend_listen_active_ws_connections'
metricsQuery: 'avg(backend_listen_active_ws_connections)'
resources:
overrides:
namespace: { resource: "namespace" }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avg() metric with HPA type: Value may react to scrape gaps, not real load

The metricsQuery computes avg(backend_listen_active_ws_connections) — the average connections per pod. When any pod temporarily stops reporting (restart, scrape failure, etc.), the average is computed over fewer samples, artificially inflating the metric. This can trigger premature scale-up events.

A more robust alternative is to use sum of all connections and let Kubernetes compute the per-replica average itself via type: AverageValue in the HPA. However, this would require changing the HPA target type to AverageValue.

As-is, the avg() approach is functionally correct under normal conditions, but it is worth noting that the absence of a pod from Prometheus scraping will cause the metric to transiently over-read.

The same pattern exists in both prod_omi_prometheus_adapter.yaml (line 9) and dev_omi_prometheus_adapter.yaml (line 9).

requestsPerPod: 10
# requestsPerPod: 10
# failedResponseCode: 10
activeConnectionsPerPod: 20
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New autoscaling metric not tested in dev environment

The dev prometheus adapter (dev_omi_prometheus_adapter.yaml) now registers the backend_listen_active_ws_connections_per_pod metric, but dev_omi_backend_listen_values.yaml does not include an activeConnectionsPerPod entry in its autoscaling section. This means the dev HPA does not exercise the new metric, so the new autoscaling behavior goes to production without a dev validation path.

Consider adding activeConnectionsPerPod to the dev values file (possibly with a different — perhaps lower — threshold) to test the end-to-end prometheus-adapter → HPA pipeline before relying on it in production.

@beastoin
Copy link
Copy Markdown
Collaborator

lgtm

@beastoin beastoin merged commit 03c6acf into main Mar 13, 2026
2 checks passed
@beastoin beastoin deleted the task/use-backend-listen-active-ws-for-auto-scaling branch March 13, 2026 10:05
Glucksberg pushed a commit to Glucksberg/omi-local that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants