perf(mq): zero-copy IPC transport with SHM + perf waterfall display by stwilt · Pull Request #82 · PlainsightAI/openfilter

stwilt · 2026-04-19T16:39:01Z

📦 Pull Request

📋 What does this PR do?

Eliminates redundant image copies on the MQ inter-filter transport hot path through four layered optimizations, then adds an opt-in POSIX shared-memory transport that bypasses the kernel socket buffer entirely. Also adds a terminal-based waterfall display for visualizing the IPC performance gains across git refs.

Four perf optimizations (cumulative on 4K native, 2 filters, raw):

Replace bytearray(memoryview(frame.image)) with memoryview.cast('B') in frames2topicmsgs — drops a full pre-send image copy (~17ms → ~10ms/hop)
recv_multipart(copy=False) on the ZMQ SUB socket — receiver gets zmq.Frame objects and np.frombuffer builds zero-copy views (~10ms → ~7.5ms/hop)
send_multipart(copy=False) on the ZMQ PUB socket — sender hands the buffer to ZMQ without an internal copy (~7.5ms → ~6.5ms/hop)
Opt-in SHM transport (OPENFILTER_SHM_ENABLE=1) — sender writes image into a POSIX shm ring slot and sends only a tiny reference over ZMQ; receiver attaches once and builds a zero-copy numpy view (~6.5ms → ~2.9ms/hop)

Plus a terminal waterfall display (make waterfall) that runs the bench harness's pipeline_simulation scenarios at two git refs and renders a rich-styled before/after comparison with per-pipeline gains and per-stage delta annotations. Useful for PR descriptions and demos.

🔍 Why is this needed?

The IPC benchmark harness (PR #77 / FILTER-415) showed that 4K-native raw pipelines spent 63% of their per-frame budget on IPC overhead — the cost of moving 24MB image buffers between filters via ZMQ was larger than the filter processing itself. The four optimizations in this PR reduce that overhead by 83% (35ms → 5.7ms per frame at 4K), lifting max throughput from 18 fps to 39 fps on the same hardware without touching any filter code.

The waterfall display was built to show these gains — it runs the same bench scenarios against any two git refs (e.g., HEAD vs origin/main) and renders a terminal panel showing exactly where the time moved. It's anchored to the bench harness's TestPipelineSimulation so it automatically follows any future runtime composition changes.

🧪 How was it tested?

make test passes (--benchmark-disable, existing test suite)
make bench passes (pipeline_simulation shows reduced IPC overhead)
OPENFILTER_SHM_ENABLE=1 make test passes (SHM path exercised)
make waterfall renders correctly against origin/main
make waterfall.profile produces /tmp/perf-waterfall.svg flamegraph
py-spy profiling confirmed bottleneck attribution at each stage
200-frame bench-mirror reproducer verified gains with realistic access patterns (fresh .ro per iteration + filter sleep)

Measured results (4K native, 2 filters, raw):

Optimization            IPC OH/frame   Per hop    Max FPS
Baseline (main)         ~35 ms         ~17.5 ms   18
+ memoryview send       ~20 ms         ~10 ms     25
+ recv copy=False       ~15 ms         ~7.5 ms    29
+ send copy=False       ~13 ms         ~6.5 ms    30
+ SHM transport         ~5.7 ms        ~2.9 ms    39

🔗 Related Issues

FILTER-419: Investigate shared-memory frame transport (this PR)
FILTER-415: IPC benchmark harness (PR feat(tests): add IPC/serialization overhead benchmarking harness #77, merged — provides the measurement infrastructure this PR optimizes against)
FILTER-367: Inference Performance & Scaling (parent epic)
PLAT-866: OTel hop/sub-hop span instrumentation (follow-up — the same runtime paths instrumented here will get OTel spans)

🖼️ Screenshots or Logs (if applicable)

The make waterfall output (run with --frames 100 --baseline-ref origin/main) shows the per-pipeline gains and a frame waterfall for the 4K case. Best viewed in a terminal with truecolor support; Plainsight brand palette applied to structural chrome, conventional green/red for diagnostic signals.

✅ Checklist

I have read and agreed to the terms of the LICENSE
I have read the CONTRIBUTING guide
I have followed the coding style
I have signed all commits in compliance with the DCO
I have added or updated tests as needed
I have added or updated documentation as needed

…FILTER-419) # 📦 Pull Request ## 📋 What does this PR do? Eliminates redundant image copies on the MQ inter-filter transport hot path through four layered optimizations, then adds an opt-in POSIX shared-memory transport that bypasses the kernel socket buffer entirely. Also adds a terminal-based waterfall display for visualizing the IPC performance gains across git refs. Four perf optimizations (cumulative on 4K native, 2 filters, raw): 1. Replace bytearray(memoryview(frame.image)) with memoryview.cast('B') in frames2topicmsgs — drops a full pre-send image copy (~17ms → ~10ms/hop) 2. recv_multipart(copy=False) on the ZMQ SUB socket — receiver gets zmq.Frame objects and np.frombuffer builds zero-copy views (~10ms → ~7.5ms/hop) 3. send_multipart(copy=False) on the ZMQ PUB socket — sender hands the buffer to ZMQ without an internal copy (~7.5ms → ~6.5ms/hop) 4. Opt-in SHM transport (OPENFILTER_SHM_ENABLE=1) — sender writes image into a POSIX shm ring slot and sends only a tiny reference over ZMQ; receiver attaches once and builds a zero-copy numpy view (~6.5ms → ~2.9ms/hop) Plus a terminal waterfall display (make waterfall) that runs the bench harness's pipeline_simulation scenarios at two git refs and renders a rich-styled before/after comparison with per-pipeline gains and per-stage delta annotations. Useful for PR descriptions and demos. ## 🔍 Why is this needed? The IPC benchmark harness (PR #77 / FILTER-415) showed that 4K-native raw pipelines spent 63% of their per-frame budget on IPC overhead — the cost of moving 24MB image buffers between filters via ZMQ was larger than the filter processing itself. The four optimizations in this PR reduce that overhead by 83% (35ms → 5.7ms per frame at 4K), lifting max throughput from 18 fps to 39 fps on the same hardware without touching any filter code. The waterfall display was built to show these gains — it runs the same bench scenarios against any two git refs (e.g., HEAD vs origin/main) and renders a terminal panel showing exactly where the time moved. It's anchored to the bench harness's TestPipelineSimulation so it automatically follows any future runtime composition changes. ## 🧪 How was it tested? - make test passes (--benchmark-disable, existing test suite) - make bench passes (pipeline_simulation shows reduced IPC overhead) - OPENFILTER_SHM_ENABLE=1 make test passes (SHM path exercised) - make waterfall renders correctly against origin/main - make waterfall.profile produces /tmp/perf-waterfall.svg flamegraph - py-spy profiling confirmed bottleneck attribution at each stage - 200-frame bench-mirror reproducer verified gains with realistic access patterns (fresh .ro per iteration + filter sleep) Measured results (4K native, 2 filters, raw): Optimization IPC OH/frame Per hop Max FPS Baseline (main) ~35 ms ~17.5 ms 18 + memoryview send ~20 ms ~10 ms 25 + recv copy=False ~15 ms ~7.5 ms 29 + send copy=False ~13 ms ~6.5 ms 30 + SHM transport ~5.7 ms ~2.9 ms 39 ## 🔗 Related Issues - FILTER-419: Investigate shared-memory frame transport (this PR) - FILTER-415: IPC benchmark harness (PR #77, merged — provides the measurement infrastructure this PR optimizes against) - FILTER-367: Inference Performance & Scaling (parent epic) - PLAT-866: OTel hop/sub-hop span instrumentation (follow-up — the same runtime paths instrumented here will get OTel spans) ## 🖼️ Screenshots or Logs (if applicable) The make waterfall output (run with --frames 100 --baseline-ref origin/main) shows the per-pipeline gains and a frame waterfall for the 4K case. Best viewed in a terminal with truecolor support; Plainsight brand palette applied to structural chrome, conventional green/red for diagnostic signals. ## ✅ Checklist - [x] I have read and agreed to the terms of the LICENSE - [x] I have read the CONTRIBUTING guide - [x] I have followed the coding style - [x] I have signed all commits in compliance with the DCO - [x] I have added or updated tests as needed - [x] I have added or updated documentation as needed Signed-off-by: stwilt <swilt@plainsight.ai>

…t bugs Three bundled fixes for CI failures on this branch: 1. mq.py: After np.frombuffer in topicmsgs2frames (both the zero-copy ZMQ path and the SHM path), set image.flags.writeable = False. The previous bytearray-based path produced read-only arrays implicitly; zero-copy frombuffer views are writeable by default. Three consumers depend on received frames being immutable: - SHM slots are recycled round-robin by the sender, so any downstream mutation would corrupt live buffers. - Frame.copy() skips the deep copy only when image is read-only. - Frame.jpg caches the encoded blob only when image is read-only. 2. test_zeromq.py: ZMQReceiver.recv_multipart(copy=False) makes payload parts arrive as zmq.Frame wrappers rather than bytes. Teach the recv/recvl test helpers (and the OOB callback lambdas) to materialize zmq.Frame -> bytes for comparison. These tests exercise ZMQ routing, not buffer type, so coercion is the right level to fix at. 3. test_filter.py: test_filter_context_read_models_toml{,_invalid_format} wrote a TOML to /tmp and then tried to rename() it into cwd as models.toml. On hosts where /tmp is a separate filesystem this fails with EXDEV. Drop tempfile; write directly via Path.write_text inside a small context-manager helper that also handles pre-existing backup. Signed-off-by: stwilt <swilt@plainsight.ai>

lucasmundim

Strong performance PR with impressive measured gains (83% IPC overhead reduction at 4K). The layered approach is clean — each optimization is independently valuable and they compose well. CI green across Python 3.10-3.13.

What looks good:

Four independent optimizations that compose cleanly (memoryview send, recv copy=False, send copy=False, SHM)
SHM is fully opt-in (OPENFILTER_SHM_ENABLE=1) — zero behavioral change for existing deployments
image.flags.writeable = False on received frames — prevents downstream mutation of SHM/zero-copy buffers
Stale SHM segment cleanup on startup (sweeps prior crashed runs)
SHMAttachCache lazily attaches — no upfront allocation on the receiver side
Pipeline scenarios lifted to module scope for tooling reuse — DRY between bench and measurement scripts
Per-hop IPC timing added to the bench harness (hop_times_ms) — good observability
Waterfall display uses PEP 723 inline deps — no pyproject changes needed
Test fixture cleanup in test_filter.py (_with_models_toml context manager) is a nice improvement
test_zeromq.py properly handles zmq.Frame objects from copy=False

Five observations inline.

…iant Addresses two concerns raised in review on PR #82: - SHMPool.put() now raises ValueError with a clear message when the frame payload exceeds slot_size, rather than silently corrupting or producing an opaque error. The message names the overflow size and points to OPENFILTER_SHM_SLOT_SIZE. - Added an invariant comment on SHMPool explaining that the round-robin slot reuse is only safe because the current ZMQ send->recv pipeline loop is synchronous (sender blocks until receiver consumes). Async/pipelined dispatch would require a generation counter or semaphore. Signed-off-by: stwilt <swilt@plainsight.ai>

Follow-up to d310e28 (read-only invariant on received frames). Adds a Breaking Changes note to the current release section explaining that filters mutating frame.image in place will now raise "ValueError: assignment destination is read-only", with the migration path (copy first, or build a new Frame). Signed-off-by: stwilt <swilt@plainsight.ai>

lucasmundim

All five observations addressed cleanly: bounds check on SHM slot overflow, synchrony invariant documented, breaking change for read-only frames in RELEASE.md. LGTM.

…R-461) ## 📋 What does this PR do? Adds a `wait_for_runner_exit(runner, timeout=5.0)` helper in `tests/test_filter.py` that polls `runner.step()` until it returns something other than `False` (or times out), and replaces five fragile single-step shutdown assertions across four `TestFilterOld` tests: - `test_topo_balance_step` - `test_topo_ephemeral_join_step` - `test_topo_doubly_ephemeral_tee_step` (two callsites) - `test_topo_subscribe_wildcard_all_step` Each callsite previously asserted that a single `runner.step()` immediately after `qout.put(None)` would return the exit-codes list (`[0] * N` for N filters). With the helper, the assertion now waits up to 5 s for the runner to actually report termination — preserving the original assertion semantics, but with the same `step → settle → assert` discipline the data-frame iterations already use. ## 🔍 Why is this needed? `test_topo_balance_step` started failing on Python 3.10 only as of [Create Release dry-run run #25936326919](https://github.com/PlainsightAI/openfilter/actions/runs/25936326919), while passing on 3.11 / 3.12 / 3.13. The same test passed on 3.10 for the v0.1.30 release ([run #24800470941](https://github.com/PlainsightAI/openfilter/actions/runs/24800470941), 2026-04-22). Between those, #82 introduced the zero-copy IPC transport with SHM — a perf rewrite of the layer this test exercises. The faster IPC shifted timing margins such that on 3.10's GIL/scheduling, at least one of N filters routinely hasn't yet drained the `None` shutdown sentinel + reported its exit code by the time the next `step()` runs. Result: `step()` returns `False` ("not done") instead of `[0, 0, …]`. This is **test fragility, not a runtime regression** — the 9 data-frame iterations in the same test exercise the full pipeline correctly across 3.10 too. But it was also a **release blocker**: the Create Release workflow gates on every matrix Python version passing run-tests, so any release dispatch was rolling dice on the 3.10 job. See [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) for the full root-cause analysis. ## 🧪 How was it tested? Set up a fresh Python 3.10 venv with the project in editable mode (`uv venv .venv-310 --python 3.10`, `uv pip install -e ".[all]"`, install pytest), and ran the failing test: ``` $ .venv-310/bin/pytest tests/test_filter.py::TestFilterOld::test_topo_balance_step -v … tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED [100%] 1 passed in 3.66s ``` Repeated 3× consecutively — all green. Also ran all four patched tests together — all pass: ``` tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED [ 25%] tests/test_filter.py::TestFilterOld::test_topo_ephemeral_join_step PASSED [ 50%] tests/test_filter.py::TestFilterOld::test_topo_doubly_ephemeral_tee_step PASSED [ 75%] tests/test_filter.py::TestFilterOld::test_topo_subscribe_wildcard_all_step PASSED [100%] 4 passed in 8.72s ``` Final CI confirmation will come from the Create Release re-dispatch after this merges — all four Python matrix versions should pass run-tests consistently. ## 🔗 Related Issues - [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) — root-cause analysis and fix proposal (this PR implements the proposed fix). - Unblocks the v0.2.1 release ceremony (validation of the `PLAINSIGHT_PYPI_TOKEN` publish path was caught by this flake before reaching the PyPI upload step). ## ✅ Checklist - [x] I have read and agreed to the terms of the [LICENSE](../LICENSE) - [x] I have read the [CONTRIBUTING](../CONTRIBUTING.md) guide - [x] I have followed the [coding style](../CONTRIBUTING.md#coding-style) - [x] I have signed all commits in compliance with the DCO (`git commit -s`) - [x] I have added or updated **tests** as needed (the fix IS the test change; helper is documented in-line) - [x] I have added or updated **documentation** as needed (helper has a docstring explaining the FILTER-461 context) Signed-off-by: stwilt <swilt@plainsight.ai>

…R-461) (#94) ## 📋 What does this PR do? Adds a `wait_for_runner_exit(runner, timeout=5.0)` helper in `tests/test_filter.py` that polls `runner.step()` until it returns something other than `False` (or times out), and replaces five fragile single-step shutdown assertions across four `TestFilterOld` tests: - `test_topo_balance_step` - `test_topo_ephemeral_join_step` - `test_topo_doubly_ephemeral_tee_step` (two callsites) - `test_topo_subscribe_wildcard_all_step` Each callsite previously asserted that a single `runner.step()` immediately after `qout.put(None)` would return the exit-codes list (`[0] * N` for N filters). With the helper, the assertion now waits up to 5 s for the runner to actually report termination — preserving the original assertion semantics, but with the same `step → settle → assert` discipline the data-frame iterations already use. ## 🔍 Why is this needed? `test_topo_balance_step` started failing on Python 3.10 only as of [Create Release dry-run run #25936326919](https://github.com/PlainsightAI/openfilter/actions/runs/25936326919), while passing on 3.11 / 3.12 / 3.13. The same test passed on 3.10 for the v0.1.30 release ([run #24800470941](https://github.com/PlainsightAI/openfilter/actions/runs/24800470941), 2026-04-22). Between those, #82 introduced the zero-copy IPC transport with SHM — a perf rewrite of the layer this test exercises. The faster IPC shifted timing margins such that on 3.10's GIL/scheduling, at least one of N filters routinely hasn't yet drained the `None` shutdown sentinel + reported its exit code by the time the next `step()` runs. Result: `step()` returns `False` ("not done") instead of `[0, 0, …]`. This is **test fragility, not a runtime regression** — the 9 data-frame iterations in the same test exercise the full pipeline correctly across 3.10 too. But it was also a **release blocker**: the Create Release workflow gates on every matrix Python version passing run-tests, so any release dispatch was rolling dice on the 3.10 job. See [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) for the full root-cause analysis. ## 🧪 How was it tested? Set up a fresh Python 3.10 venv with the project in editable mode (`uv venv .venv-310 --python 3.10`, `uv pip install -e ".[all]"`, install pytest), and ran the failing test: ``` $ .venv-310/bin/pytest tests/test_filter.py::TestFilterOld::test_topo_balance_step -v … tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED [100%] 1 passed in 3.66s ``` Repeated 3× consecutively — all green. Also ran all four patched tests together — all pass: ``` tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED [ 25%] tests/test_filter.py::TestFilterOld::test_topo_ephemeral_join_step PASSED [ 50%] tests/test_filter.py::TestFilterOld::test_topo_doubly_ephemeral_tee_step PASSED [ 75%] tests/test_filter.py::TestFilterOld::test_topo_subscribe_wildcard_all_step PASSED [100%] 4 passed in 8.72s ``` Final CI confirmation will come from the Create Release re-dispatch after this merges — all four Python matrix versions should pass run-tests consistently. ## 🔗 Related Issues - [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) — root-cause analysis and fix proposal (this PR implements the proposed fix). - Unblocks the v0.2.1 release ceremony (validation of the `PLAINSIGHT_PYPI_TOKEN` publish path was caught by this flake before reaching the PyPI upload step). ## ✅ Checklist - [x] I have read and agreed to the terms of the [LICENSE](../LICENSE) - [x] I have read the [CONTRIBUTING](../CONTRIBUTING.md) guide - [x] I have followed the [coding style](../CONTRIBUTING.md#coding-style) - [x] I have signed all commits in compliance with the DCO (`git commit -s`) - [x] I have added or updated **tests** as needed (the fix IS the test change; helper is documented in-line) - [x] I have added or updated **documentation** as needed (helper has a docstring explaining the FILTER-461 context) [FILTER-461]: https://plainsight-ai.atlassian.net/browse/FILTER-461?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Signed-off-by: stwilt <swilt@plainsight.ai>

stwilt requested review from lucasmundim and shingonoide April 19, 2026 16:39

shingonoide assigned stwilt Apr 19, 2026

lucasmundim reviewed Apr 19, 2026

View reviewed changes

Comment thread openfilter/filter_runtime/shm_transport.py

Comment thread openfilter/filter_runtime/shm_transport.py

Comment thread openfilter/filter_runtime/mq.py

Comment thread openfilter/filter_runtime/mq.py

Comment thread openfilter/filter_runtime/mq.py

stwilt added 2 commits April 19, 2026 20:14

stwilt requested a review from lucasmundim April 20, 2026 01:23

lucasmundim approved these changes Apr 20, 2026

View reviewed changes

stwilt merged commit 72586c2 into main Apr 20, 2026
10 checks passed

stwilt deleted the feat/filter-419-ipc-perf branch April 20, 2026 05:17

stwilt mentioned this pull request May 15, 2026

fix(tests): poll for runner exit instead of single-step assert (FILTER-461) #94

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(mq): zero-copy IPC transport with SHM + perf waterfall display#82

perf(mq): zero-copy IPC transport with SHM + perf waterfall display#82
stwilt merged 4 commits into
mainfrom
feat/filter-419-ipc-perf

stwilt commented Apr 19, 2026

Uh oh!

lucasmundim left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucasmundim left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stwilt commented Apr 19, 2026

📦 Pull Request

📋 What does this PR do?

🔍 Why is this needed?

🧪 How was it tested?

🔗 Related Issues

🖼️ Screenshots or Logs (if applicable)

✅ Checklist

Uh oh!

lucasmundim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucasmundim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants