Skip to content

Add adaptive backpressure and bound librdkafka internal memory#32

Merged
jghoman merged 2 commits intomainfrom
adaptive-batch-size
Apr 2, 2026
Merged

Add adaptive backpressure and bound librdkafka internal memory#32
jghoman merged 2 commits intomainfrom
adaptive-batch-size

Conversation

@jghoman
Copy link
Copy Markdown
Collaborator

@jghoman jghoman commented Apr 2, 2026

Summary

Two complementary mechanisms to prevent OOM during catchup and traffic spikes:

Adaptive batch sizing (backpressure.py)

Consume batch size scales proportionally with buffer fullness:

fullness = pending_bytes / flush_size
batch_size = max(10, CONSUME_BATCH_SIZE * (1.0 - fullness))

Buffer empty → consume at full speed. Buffer approaching flush threshold → consume in tiny batches. No state machine, no mode switching. Handles catchup, steady state, and bursts with one formula.

This is a throughput-smoothing mechanism — it controls how fast we dequeue from librdkafka's internal buffer.

librdkafka memory bound (consumer.py)

queued.max.messages.kbytes=16384 (16MB per partition). This is the OOM prevention mechanism. Without it, librdkafka pre-fetches up to 64MB per partition regardless of consume rate. With 8 partitions (prod at 64 replicas), that's 512MB of uncontrolled internal buffering. Now capped at 128MB.

New metrics

  • millpond_buffer_fullness — ratio of pending bytes to flush size (0.0 = empty, 1.0 = flush threshold)
  • millpond_consume_batch_size_current — current adaptive batch size

Test plan

  • 150 unit tests pass (14 new for backpressure, 1 new for queued.max.messages.kbytes)
  • Lint + format clean
  • Deploy to dev, verify buffer_fullness metric in Grafana
  • Verify no OOM during catchup with reduced flush sizes

Proportional batch sizing: consume batch size scales linearly from
CONSUME_BATCH_SIZE (buffer empty) to 10 (buffer at flush threshold).
Smooths throughput during catchup and traffic spikes.

Bound librdkafka memory: queued.max.messages.kbytes=16384 (16MB per
partition) prevents librdkafka from pre-fetching unbounded data.
This is the actual OOM prevention; batch sizing is throughput smoothing.

New metrics: millpond_buffer_fullness, millpond_consume_batch_size_current.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds adaptive consume backpressure and caps librdkafka’s internal consumer queue to reduce OOM risk during catchup and traffic spikes.

Changes:

  • Introduces proportional batch sizing based on pending buffer fullness (millpond/backpressure.py) and wires it into the main consume loop.
  • Adds a librdkafka queue memory bound via queued.max.messages.kbytes and tests for it.
  • Adds Prometheus gauges for buffer fullness and current adaptive batch size, plus documentation updates.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
millpond/backpressure.py New adaptive batch sizing logic + metric emission.
millpond/main.py Uses adaptive batch size when calling consumer.consume().
millpond/consumer.py Sets queued.max.messages.kbytes to bound internal buffering.
millpond/metrics.py Adds gauges for buffer fullness and current batch size.
tests/unit/test_backpressure.py Unit tests for batch sizing + metric updates.
tests/unit/test_consumer.py Unit test asserting consumer config includes queue bound.
README.md Documents adaptive backpressure behavior and metrics.
AGENT.md Adds design notes for adaptive backpressure and metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread millpond/backpressure.py Outdated
Comment thread millpond/backpressure.py Outdated
Comment thread millpond/consumer.py
Comment thread README.md Outdated
Comment thread README.md Outdated
Comment thread AGENT.md Outdated
- Remove unused logger from backpressure.py
- Clamp max_batch_size to MIN_BATCH_SIZE in init()
- Use setdefault for queued.max.messages.kbytes (allow env override)
- Fix README formula to include int()
- Align docs: backpressure is throughput smoothing, not OOM prevention
- Fix AGENT.md: buffer_fullness can exceed 1.0
- Merge conflict resolution: include offset resume tests from main
@jghoman jghoman merged commit 33f49e7 into main Apr 2, 2026
15 checks passed
@jghoman jghoman deleted the adaptive-batch-size branch April 2, 2026 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants