Problem
Pusher pods hit the 4.5GB memory limit and get OOMKilled. After the deque fix (#6762), Python object queues are properly bounded, but RSS still grows linearly until OOM.
Root Cause
Live-process profiling via gdb injection into PID 1 confirmed:
- 0 bytearrays and 0 bytes objects leaking — Python properly frees audio buffers via reference counting
- glibc
malloc fragmentation is the actual cause:
malloc arena: ~580MB allocated from OS
- Actually in-use: ~270MB
- ~305MB (53%) is freed but trapped in fragmented arena chunks glibc can't return to OS
malloc_trim(0) only reclaims ~6MB (topmost chunk) — interior fragments are stuck
- Arena keeps expanding because new allocations can't reuse fragmented free space
The pattern: each WebSocket connection continuously allocates → extends → copies → frees bytearray audio buffers. glibc's per-thread arenas fragment under this high-churn pattern, and freed memory is never returned to the OS.
Solution
LD_PRELOAD=libjemalloc.so.2 — industry standard fix for this exact pattern.
jemalloc uses thread-local caches + size-class slabs that eliminate the fragmentation glibc creates. Used by Redis (default since 2.4), Firefox, and most long-running servers with high-churn allocations.
Implementation
One line in the pusher Dockerfile:
RUN apt-get install -y libjemalloc2
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
No code changes needed. Drop-in replacement at the allocator level.
Expected result
RSS stays flat near actual in-use (~300MB) instead of growing unbounded to 4.5GB.
Verification
After deploy, monitor pod RSS over 24h — should plateau instead of linear growth. Check with:
kubectl exec <pod> -- python3 -c "
with open('/proc/self/status') as f:
for l in f:
if 'VmRSS' in l: print(l.strip())
"
Evidence
- Profiling method:
gdb -batch -p 1 → PyGILState_Ensure → PyRun_SimpleFile (gc.get_objects + mallinfo2)
- Two snapshots 10 min apart: GC-tracked objects grew only ~10MB, but RSS grew ~60MB — all in malloc arena expansion
mallinfo2() confirmed 53% fragmentation ratio
- All deque queues bounded (maxlen=20), coroutine/task counts stable and proportional to connection count
Problem
Pusher pods hit the 4.5GB memory limit and get OOMKilled. After the deque fix (#6762), Python object queues are properly bounded, but RSS still grows linearly until OOM.
Root Cause
Live-process profiling via gdb injection into PID 1 confirmed:
mallocfragmentation is the actual cause:mallocarena: ~580MB allocated from OSmalloc_trim(0)only reclaims ~6MB (topmost chunk) — interior fragments are stuckThe pattern: each WebSocket connection continuously allocates → extends → copies → frees
bytearrayaudio buffers. glibc's per-thread arenas fragment under this high-churn pattern, and freed memory is never returned to the OS.Solution
LD_PRELOAD=libjemalloc.so.2— industry standard fix for this exact pattern.jemalloc uses thread-local caches + size-class slabs that eliminate the fragmentation glibc creates. Used by Redis (default since 2.4), Firefox, and most long-running servers with high-churn allocations.
Implementation
One line in the pusher Dockerfile:
No code changes needed. Drop-in replacement at the allocator level.
Expected result
RSS stays flat near actual in-use (~300MB) instead of growing unbounded to 4.5GB.
Verification
After deploy, monitor pod RSS over 24h — should plateau instead of linear growth. Check with:
Evidence
gdb -batch -p 1→PyGILState_Ensure→PyRun_SimpleFile(gc.get_objects + mallinfo2)mallinfo2()confirmed 53% fragmentation ratio