Skip to content

Reduce private cloud sync costs: Opus encoding + batch GCS uploads + Filestore spool #5418

@beastoin

Description

@beastoin

Problem

Private cloud audio sync uploads raw PCM16 chunks (80KB each) directly to GCS every 5 seconds per active session. At current scale (200 concurrent sessions), this generates 3.5M GCS Class A write ops/day (~$518/month). At 100x scale this becomes $51,840/month in ops alone, and the approach doesn't scale further.

Root Causes

  1. No compression: Raw PCM16 at 16KB/s stored as-is (or encrypted). Opus encoding would give ~10x reduction in storage and bandwidth.
  2. Per-chunk GCS writes: Every 5-second chunk = 1 GCS Class A write op. No batching.
  3. In-memory only queue: Audio chunks are buffered in memory. Pod crash = data loss.

Proposed Solution (Phased)

Phase 1: Opus Encoding (Immediate Win)

  • Add opuslib.Encoder(sample_rate, 1, APPLICATION_VOIP) in pusher
  • Encode PCM → Opus before encryption: PCM → Opus → Encrypt
  • ~10x storage reduction: 80KB → ~8KB per 5-second chunk
  • CPU cost: ~0.3-0.8% vCPU per session (32.5ms per 5s chunk)
  • New extensions: .opus (standard) / .opus.enc (enhanced)
  • Update storage.py list/delete/download to support new extensions
  • Update merge_conversations.py extension handling
  • Update conversations.py duration math (don't assume fixed +5s)

Phase 2: 60-Second Batch Upload

  • Accumulate 60s of chunks before single GCS upload (12x fewer ops)
  • Flush triggers: max_age=60s, max_bytes, conversation end, pod shutdown
  • Run batch sync in dedicated worker deployment (not on pusher hot path)
  • GCS upload idempotency via deterministic object names

Phase 3: Filestore Spool (If Durability SLO Requires)

  • Write chunks to Filestore NFS mount instead of memory
  • State machine: .open → .ready → .uploading → .done
  • Atomic rename + lease timeout for crash recovery
  • Filestore tier selection:
    • 1x: Basic HDD 100GiB on GKE (~$49/mo)
    • 100x: Basic SSD or Zonal (~$768/mo)
    • 10,000x: Must shard across multiple instances

Cost Impact

Scale Current (5s writes) Phase 2 (60s batches) Phase 3 (+ Filestore)
1x (200 sessions) $518/mo $43/mo ~$92/mo
100x (20K sessions) $51,840/mo $4,320/mo ~$5,088/mo
10,000x (2M sessions) $5,184,000/mo $432,000/mo + sharding cost

Opus encoding reduces storage/bandwidth ~10x but does NOT reduce Class A ops (same number of objects unless batched).

Compatibility Requirements

  • Must handle mixed legacy (.bin/.enc) and new (.opus/.opus.enc) chunks
  • Backward compatible: existing recordings remain readable
  • download_audio_chunks_and_merge() must decode Opus back to PCM for playback
  • Feature-flagged rollout per phase

Files to Modify

  • backend/routers/pusher.py — Opus encoder init, batch accumulation
  • backend/utils/other/storage.py — new extensions, batch upload, Opus-aware list/download
  • backend/utils/encryption.py — encrypt Opus bytes (same API, different input)
  • backend/database/conversations.py — flexible duration math
  • backend/utils/conversations/merge_conversations.py — generic extension handling
  • Helm values — Filestore mount config (Phase 3)

Acceptance Criteria

  • Phase 1: Opus-encoded chunks uploaded to GCS, existing recordings unaffected
  • Phase 2: 60s batch uploads reduce Class A ops by ~12x
  • Phase 3: Filestore spool survives pod restart without data loss
  • All phases: mixed old/new chunk formats handled transparently
  • Tests: unit tests for Opus encode/decode, batch flush triggers, mixed-format retrieval

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestp1Priority: Critical (score 22-29)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions