backend: raise storage_executor pool 96 → 128 by mdmohsin7 · Pull Request #7529 · BasedHardware/omi

mdmohsin7 · 2026-05-28T20:03:10Z

Summary

Raise storage_executor max_workers from 96 → 128 in backend/utils/executors.py.

Why

After the previous round of changes (minScale=10, concurrency=6, _PRECACHE_FILE_SEM=2) the storage pool on backend-sync is still pegged at 100% utilization. The queue depth distribution over the past 30 minutes (steady-state on the new revision):

Percentile	Queue depth
p50	19
p95	30
p99	38
max	43

96 + 32 ≈ 128 cleanly absorbs the observed p95 queue depth. Sync work that was waiting in this queue should drop to near zero in the common case.

Safety analysis (verified, not assumed)

Memory:

Current memory p99 per instance: 13% of 8 GB
Linux lazy-commits pthread stacks — virtual address space grows ~64 MB (32 × 2 MB) but actual RSS impact is ~1–2 MB per instance for idle/I/O-bound threads
Plenty of headroom

CPU:

Current CPU p99 per instance: 84% (bursty, not sustained)
Storage work is mostly I/O-bound (GCS, network) — Python releases the GIL during blocking syscalls, so additional threads scale with I/O latency, not with CPU
CPU-bound portion (PCM merge / WAV encoding) is bounded by the 2 vCPU ceiling regardless of thread count

Cloud Run autoscaler: unaffected. Autoscales on CPU + request concurrency; thread count is a per-instance knob.

Test plan

Watch executor_pool_health warnings on backend-sync for 15–30 min post-deploy — storage max_q should drop substantially (target: well under 10 in p95)
Watch memory utilization — should not exceed 25% per instance even at peak
No new 5xx errors on /v2/sync-local-files or /v1/sync/audio/*
If max_q still > 10 consistently after this, follow-up bump to 140 (one-line change)

Storage pool was averaging 100% util on backend-sync with queue depth distribution (30-min steady-state, post minScale=10 / concurrency=6 / PRECACHE_FILE_SEM=2): p50=19, p95=30, p99=38, max=43. 96 + 32 ≈ 128 cleanly absorbs the observed p95 queue depth. Memory headroom is comfortable: per-instance p99 sits at 13% of 8GB, and Linux lazy-commits thread stacks so the additional 32 threads add ~1–2 MB actual RSS per instance, not the virtual address space figure. CPU p99 bursts to 84% but most storage work is I/O-bound (GCS calls, network) — GIL is released during blocking syscalls so the extra threads scale with the I/O latency they hide, not with CPU. The CPU-bound portion (PCM merge / WAV encoding inside audio_merge) remains bounded by the 2 vCPU ceiling regardless. If queue depth still trends above ~10 after this, the next move is a bump to 140 (covers max=43); 192 is overshoot for the observed distribution.

greptile-apps · 2026-05-28T20:05:06Z

Greptile Summary

This PR raises the storage_executor thread-pool ceiling from 96 to 128 in response to sustained 100% utilisation and observed queue depths (p95 ≈ 30, max ≈ 43) on the backend-sync service. The +32 workers are sized to absorb the p95 queue depth with headroom, and the PR includes a thorough memory/CPU safety analysis and a concrete post-deploy watch list.

Only changed line: max_workers=96 → max_workers=128 on storage_executor in backend/utils/executors.py. All other executor sizes, monitoring logic, and shutdown behaviour are untouched.
Total per-instance thread count across all pools rises from 182 → 210, which remains well within Linux and Cloud Run limits; storage work is I/O-bound (GCS/network), so the GIL is not a bottleneck for these extra threads.

Confidence Score: 5/5

Safe to merge — the diff is a single integer constant change backed by production queue-depth data and a full memory/CPU safety analysis.

The change touches exactly one value in a pool-size declaration. The +32 workers are I/O-bound (GCS/network), so they scale with I/O latency rather than CPU, and the PR's own memory analysis shows RSS impact of ~1–2 MB per instance — well within the reported 87% headroom. No logic, no interfaces, and no shutdown paths were altered.

No files require special attention. backend/utils/executors.py is the only changed file and the modification is a single constant.

Important Files Changed

Filename	Overview
backend/utils/executors.py	Single-line bump of storage_executor max_workers from 96 to 128; no logic changes, no new APIs, no behaviour change beyond pool capacity.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph PerInstance["Per Cloud Run Instance (2 vCPU / 8 GB)"]
        direction TB
        CE["critical_executor\nmax_workers=8"]
        DE["db_executor\nmax_workers=24"]
        LE["llm_executor\nmax_workers=6"]
        SE["stripe_executor\nmax_workers=4"]
        SYE["sync_executor\nmax_workers=16"]
        PE["postprocess_executor\nmax_workers=24"]
        STOE["storage_executor\nmax_workers=128\n(was 96 — changed)"]
    end

    REQ["Incoming Request"] --> CE & DE & LE & SE & SYE & PE & STOE

    STOE -->|"GCS / network I/O\nGIL released"| GCS["Google Cloud Storage"]
    STOE -->|"PCM merge / WAV encode\nbounded by 2 vCPU"| AUDIO["Audio processing"]

    STOE -.->|"queue_depth p95=30\nmax=43"| QUEUE["Work queue\n(was pegged at 100% util)"]

    style STOE fill:#ffd700,stroke:#333,stroke-width:2px

_{Reviews (1): Last reviewed commit: "backend: raise storage_executor pool 96 ..." | Re-trigger Greptile}

mdmohsin7 merged commit 22d3b44 into main May 28, 2026
2 checks passed

mdmohsin7 deleted the caleb/storage-pool-128 branch May 28, 2026 20:05

mdmohsin7 mentioned this pull request May 28, 2026

Sync infra changes — May 28: storage pool, autoscaling, and app-side rate-limit fixes #7531

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend: raise storage_executor pool 96 → 128#7529

backend: raise storage_executor pool 96 → 128#7529
mdmohsin7 merged 1 commit into
mainfrom
caleb/storage-pool-128

mdmohsin7 commented May 28, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mdmohsin7 commented May 28, 2026

Summary

Why

Safety analysis (verified, not assumed)

Test plan

Uh oh!

greptile-apps Bot commented May 28, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant