Skip to content

backend: raise storage_executor pool 96 → 128#7529

Merged
mdmohsin7 merged 1 commit into
mainfrom
caleb/storage-pool-128
May 28, 2026
Merged

backend: raise storage_executor pool 96 → 128#7529
mdmohsin7 merged 1 commit into
mainfrom
caleb/storage-pool-128

Conversation

@mdmohsin7
Copy link
Copy Markdown
Member

Summary

  • Raise storage_executor max_workers from 96 → 128 in backend/utils/executors.py.

Why

After the previous round of changes (minScale=10, concurrency=6, _PRECACHE_FILE_SEM=2) the storage pool on backend-sync is still pegged at 100% utilization. The queue depth distribution over the past 30 minutes (steady-state on the new revision):

Percentile Queue depth
p50 19
p95 30
p99 38
max 43

96 + 32 ≈ 128 cleanly absorbs the observed p95 queue depth. Sync work that was waiting in this queue should drop to near zero in the common case.

Safety analysis (verified, not assumed)

Memory:

  • Current memory p99 per instance: 13% of 8 GB
  • Linux lazy-commits pthread stacks — virtual address space grows ~64 MB (32 × 2 MB) but actual RSS impact is ~1–2 MB per instance for idle/I/O-bound threads
  • Plenty of headroom

CPU:

  • Current CPU p99 per instance: 84% (bursty, not sustained)
  • Storage work is mostly I/O-bound (GCS, network) — Python releases the GIL during blocking syscalls, so additional threads scale with I/O latency, not with CPU
  • CPU-bound portion (PCM merge / WAV encoding) is bounded by the 2 vCPU ceiling regardless of thread count

Cloud Run autoscaler: unaffected. Autoscales on CPU + request concurrency; thread count is a per-instance knob.

Test plan

  • Watch executor_pool_health warnings on backend-sync for 15–30 min post-deploy — storage max_q should drop substantially (target: well under 10 in p95)
  • Watch memory utilization — should not exceed 25% per instance even at peak
  • No new 5xx errors on /v2/sync-local-files or /v1/sync/audio/*
  • If max_q still > 10 consistently after this, follow-up bump to 140 (one-line change)

Storage pool was averaging 100% util on backend-sync with queue depth
distribution (30-min steady-state, post minScale=10 / concurrency=6 /
PRECACHE_FILE_SEM=2): p50=19, p95=30, p99=38, max=43.

96 + 32 ≈ 128 cleanly absorbs the observed p95 queue depth. Memory
headroom is comfortable: per-instance p99 sits at 13% of 8GB, and Linux
lazy-commits thread stacks so the additional 32 threads add ~1–2 MB
actual RSS per instance, not the virtual address space figure.

CPU p99 bursts to 84% but most storage work is I/O-bound (GCS calls,
network) — GIL is released during blocking syscalls so the extra
threads scale with the I/O latency they hide, not with CPU. The
CPU-bound portion (PCM merge / WAV encoding inside audio_merge)
remains bounded by the 2 vCPU ceiling regardless.

If queue depth still trends above ~10 after this, the next move is a
bump to 140 (covers max=43); 192 is overshoot for the observed
distribution.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR raises the storage_executor thread-pool ceiling from 96 to 128 in response to sustained 100% utilisation and observed queue depths (p95 ≈ 30, max ≈ 43) on the backend-sync service. The +32 workers are sized to absorb the p95 queue depth with headroom, and the PR includes a thorough memory/CPU safety analysis and a concrete post-deploy watch list.

  • Only changed line: max_workers=96max_workers=128 on storage_executor in backend/utils/executors.py. All other executor sizes, monitoring logic, and shutdown behaviour are untouched.
  • Total per-instance thread count across all pools rises from 182 → 210, which remains well within Linux and Cloud Run limits; storage work is I/O-bound (GCS/network), so the GIL is not a bottleneck for these extra threads.

Confidence Score: 5/5

Safe to merge — the diff is a single integer constant change backed by production queue-depth data and a full memory/CPU safety analysis.

The change touches exactly one value in a pool-size declaration. The +32 workers are I/O-bound (GCS/network), so they scale with I/O latency rather than CPU, and the PR's own memory analysis shows RSS impact of ~1–2 MB per instance — well within the reported 87% headroom. No logic, no interfaces, and no shutdown paths were altered.

No files require special attention. backend/utils/executors.py is the only changed file and the modification is a single constant.

Important Files Changed

Filename Overview
backend/utils/executors.py Single-line bump of storage_executor max_workers from 96 to 128; no logic changes, no new APIs, no behaviour change beyond pool capacity.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph PerInstance["Per Cloud Run Instance (2 vCPU / 8 GB)"]
        direction TB
        CE["critical_executor\nmax_workers=8"]
        DE["db_executor\nmax_workers=24"]
        LE["llm_executor\nmax_workers=6"]
        SE["stripe_executor\nmax_workers=4"]
        SYE["sync_executor\nmax_workers=16"]
        PE["postprocess_executor\nmax_workers=24"]
        STOE["storage_executor\nmax_workers=128\n(was 96 — changed)"]
    end

    REQ["Incoming Request"] --> CE & DE & LE & SE & SYE & PE & STOE

    STOE -->|"GCS / network I/O\nGIL released"| GCS["Google Cloud Storage"]
    STOE -->|"PCM merge / WAV encode\nbounded by 2 vCPU"| AUDIO["Audio processing"]

    STOE -.->|"queue_depth p95=30\nmax=43"| QUEUE["Work queue\n(was pegged at 100% util)"]

    style STOE fill:#ffd700,stroke:#333,stroke-width:2px
Loading

Reviews (1): Last reviewed commit: "backend: raise storage_executor pool 96 ..." | Re-trigger Greptile

@mdmohsin7 mdmohsin7 merged commit 22d3b44 into main May 28, 2026
2 checks passed
@mdmohsin7 mdmohsin7 deleted the caleb/storage-pool-128 branch May 28, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant