perf(mq): zero-copy IPC transport with SHM + perf waterfall display#82
Merged
Conversation
…FILTER-419)
# 📦 Pull Request
## 📋 What does this PR do?
Eliminates redundant image copies on the MQ inter-filter transport
hot path through four layered optimizations, then adds an opt-in
POSIX shared-memory transport that bypasses the kernel socket buffer
entirely. Also adds a terminal-based waterfall display for visualizing
the IPC performance gains across git refs.
Four perf optimizations (cumulative on 4K native, 2 filters, raw):
1. Replace bytearray(memoryview(frame.image)) with memoryview.cast('B')
in frames2topicmsgs — drops a full pre-send image copy (~17ms → ~10ms/hop)
2. recv_multipart(copy=False) on the ZMQ SUB socket — receiver gets
zmq.Frame objects and np.frombuffer builds zero-copy views (~10ms → ~7.5ms/hop)
3. send_multipart(copy=False) on the ZMQ PUB socket — sender hands the
buffer to ZMQ without an internal copy (~7.5ms → ~6.5ms/hop)
4. Opt-in SHM transport (OPENFILTER_SHM_ENABLE=1) — sender writes image
into a POSIX shm ring slot and sends only a tiny reference over ZMQ;
receiver attaches once and builds a zero-copy numpy view (~6.5ms → ~2.9ms/hop)
Plus a terminal waterfall display (make waterfall) that runs the bench
harness's pipeline_simulation scenarios at two git refs and renders a
rich-styled before/after comparison with per-pipeline gains and
per-stage delta annotations. Useful for PR descriptions and demos.
## 🔍 Why is this needed?
The IPC benchmark harness (PR #77 / FILTER-415) showed that 4K-native
raw pipelines spent 63% of their per-frame budget on IPC overhead —
the cost of moving 24MB image buffers between filters via ZMQ was
larger than the filter processing itself. The four optimizations in
this PR reduce that overhead by 83% (35ms → 5.7ms per frame at 4K),
lifting max throughput from 18 fps to 39 fps on the same hardware
without touching any filter code.
The waterfall display was built to show these gains — it runs the same
bench scenarios against any two git refs (e.g., HEAD vs origin/main)
and renders a terminal panel showing exactly where the time moved.
It's anchored to the bench harness's TestPipelineSimulation so it
automatically follows any future runtime composition changes.
## 🧪 How was it tested?
- make test passes (--benchmark-disable, existing test suite)
- make bench passes (pipeline_simulation shows reduced IPC overhead)
- OPENFILTER_SHM_ENABLE=1 make test passes (SHM path exercised)
- make waterfall renders correctly against origin/main
- make waterfall.profile produces /tmp/perf-waterfall.svg flamegraph
- py-spy profiling confirmed bottleneck attribution at each stage
- 200-frame bench-mirror reproducer verified gains with realistic
access patterns (fresh .ro per iteration + filter sleep)
Measured results (4K native, 2 filters, raw):
Optimization IPC OH/frame Per hop Max FPS
Baseline (main) ~35 ms ~17.5 ms 18
+ memoryview send ~20 ms ~10 ms 25
+ recv copy=False ~15 ms ~7.5 ms 29
+ send copy=False ~13 ms ~6.5 ms 30
+ SHM transport ~5.7 ms ~2.9 ms 39
## 🔗 Related Issues
- FILTER-419: Investigate shared-memory frame transport (this PR)
- FILTER-415: IPC benchmark harness (PR #77, merged — provides the
measurement infrastructure this PR optimizes against)
- FILTER-367: Inference Performance & Scaling (parent epic)
- PLAT-866: OTel hop/sub-hop span instrumentation (follow-up — the
same runtime paths instrumented here will get OTel spans)
## 🖼️ Screenshots or Logs (if applicable)
The make waterfall output (run with --frames 100 --baseline-ref origin/main)
shows the per-pipeline gains and a frame waterfall for the 4K case.
Best viewed in a terminal with truecolor support; Plainsight brand
palette applied to structural chrome, conventional green/red for
diagnostic signals.
## ✅ Checklist
- [x] I have read and agreed to the terms of the LICENSE
- [x] I have read the CONTRIBUTING guide
- [x] I have followed the coding style
- [x] I have signed all commits in compliance with the DCO
- [x] I have added or updated tests as needed
- [x] I have added or updated documentation as needed
Signed-off-by: stwilt <swilt@plainsight.ai>
…t bugs
Three bundled fixes for CI failures on this branch:
1. mq.py: After np.frombuffer in topicmsgs2frames (both the zero-copy
ZMQ path and the SHM path), set image.flags.writeable = False. The
previous bytearray-based path produced read-only arrays implicitly;
zero-copy frombuffer views are writeable by default. Three consumers
depend on received frames being immutable:
- SHM slots are recycled round-robin by the sender, so any
downstream mutation would corrupt live buffers.
- Frame.copy() skips the deep copy only when image is read-only.
- Frame.jpg caches the encoded blob only when image is read-only.
2. test_zeromq.py: ZMQReceiver.recv_multipart(copy=False) makes payload
parts arrive as zmq.Frame wrappers rather than bytes. Teach the
recv/recvl test helpers (and the OOB callback lambdas) to materialize
zmq.Frame -> bytes for comparison. These tests exercise ZMQ routing,
not buffer type, so coercion is the right level to fix at.
3. test_filter.py: test_filter_context_read_models_toml{,_invalid_format}
wrote a TOML to /tmp and then tried to rename() it into cwd as
models.toml. On hosts where /tmp is a separate filesystem this fails
with EXDEV. Drop tempfile; write directly via Path.write_text inside
a small context-manager helper that also handles pre-existing backup.
Signed-off-by: stwilt <swilt@plainsight.ai>
lucasmundim
reviewed
Apr 19, 2026
Contributor
lucasmundim
left a comment
There was a problem hiding this comment.
Strong performance PR with impressive measured gains (83% IPC overhead reduction at 4K). The layered approach is clean — each optimization is independently valuable and they compose well. CI green across Python 3.10-3.13.
What looks good:
- Four independent optimizations that compose cleanly (memoryview send, recv copy=False, send copy=False, SHM)
- SHM is fully opt-in (
OPENFILTER_SHM_ENABLE=1) — zero behavioral change for existing deployments image.flags.writeable = Falseon received frames — prevents downstream mutation of SHM/zero-copy buffers- Stale SHM segment cleanup on startup (sweeps prior crashed runs)
SHMAttachCachelazily attaches — no upfront allocation on the receiver side- Pipeline scenarios lifted to module scope for tooling reuse — DRY between bench and measurement scripts
- Per-hop IPC timing added to the bench harness (
hop_times_ms) — good observability - Waterfall display uses PEP 723 inline deps — no pyproject changes needed
- Test fixture cleanup in
test_filter.py(_with_models_tomlcontext manager) is a nice improvement test_zeromq.pyproperly handleszmq.Frameobjects fromcopy=False
Five observations inline.
…iant Addresses two concerns raised in review on PR #82: - SHMPool.put() now raises ValueError with a clear message when the frame payload exceeds slot_size, rather than silently corrupting or producing an opaque error. The message names the overflow size and points to OPENFILTER_SHM_SLOT_SIZE. - Added an invariant comment on SHMPool explaining that the round-robin slot reuse is only safe because the current ZMQ send->recv pipeline loop is synchronous (sender blocks until receiver consumes). Async/pipelined dispatch would require a generation counter or semaphore. Signed-off-by: stwilt <swilt@plainsight.ai>
Follow-up to d310e28 (read-only invariant on received frames). Adds a Breaking Changes note to the current release section explaining that filters mutating frame.image in place will now raise "ValueError: assignment destination is read-only", with the migration path (copy first, or build a new Frame). Signed-off-by: stwilt <swilt@plainsight.ai>
lucasmundim
approved these changes
Apr 20, 2026
Contributor
lucasmundim
left a comment
There was a problem hiding this comment.
All five observations addressed cleanly: bounds check on SHM slot overflow, synchrony invariant documented, breaking change for read-only frames in RELEASE.md. LGTM.
6 tasks
stwilt
added a commit
that referenced
this pull request
May 15, 2026
…R-461) ## 📋 What does this PR do? Adds a `wait_for_runner_exit(runner, timeout=5.0)` helper in `tests/test_filter.py` that polls `runner.step()` until it returns something other than `False` (or times out), and replaces five fragile single-step shutdown assertions across four `TestFilterOld` tests: - `test_topo_balance_step` - `test_topo_ephemeral_join_step` - `test_topo_doubly_ephemeral_tee_step` (two callsites) - `test_topo_subscribe_wildcard_all_step` Each callsite previously asserted that a single `runner.step()` immediately after `qout.put(None)` would return the exit-codes list (`[0] * N` for N filters). With the helper, the assertion now waits up to 5 s for the runner to actually report termination — preserving the original assertion semantics, but with the same `step → settle → assert` discipline the data-frame iterations already use. ## 🔍 Why is this needed? `test_topo_balance_step` started failing on Python 3.10 only as of [Create Release dry-run run #25936326919](https://github.com/PlainsightAI/openfilter/actions/runs/25936326919), while passing on 3.11 / 3.12 / 3.13. The same test passed on 3.10 for the v0.1.30 release ([run #24800470941](https://github.com/PlainsightAI/openfilter/actions/runs/24800470941), 2026-04-22). Between those, #82 introduced the zero-copy IPC transport with SHM — a perf rewrite of the layer this test exercises. The faster IPC shifted timing margins such that on 3.10's GIL/scheduling, at least one of N filters routinely hasn't yet drained the `None` shutdown sentinel + reported its exit code by the time the next `step()` runs. Result: `step()` returns `False` ("not done") instead of `[0, 0, …]`. This is **test fragility, not a runtime regression** — the 9 data-frame iterations in the same test exercise the full pipeline correctly across 3.10 too. But it was also a **release blocker**: the Create Release workflow gates on every matrix Python version passing run-tests, so any release dispatch was rolling dice on the 3.10 job. See [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) for the full root-cause analysis. ## 🧪 How was it tested? Set up a fresh Python 3.10 venv with the project in editable mode (`uv venv .venv-310 --python 3.10`, `uv pip install -e ".[all]"`, install pytest), and ran the failing test: ``` $ .venv-310/bin/pytest tests/test_filter.py::TestFilterOld::test_topo_balance_step -v … tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED [100%] 1 passed in 3.66s ``` Repeated 3× consecutively — all green. Also ran all four patched tests together — all pass: ``` tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED [ 25%] tests/test_filter.py::TestFilterOld::test_topo_ephemeral_join_step PASSED [ 50%] tests/test_filter.py::TestFilterOld::test_topo_doubly_ephemeral_tee_step PASSED [ 75%] tests/test_filter.py::TestFilterOld::test_topo_subscribe_wildcard_all_step PASSED [100%] 4 passed in 8.72s ``` Final CI confirmation will come from the Create Release re-dispatch after this merges — all four Python matrix versions should pass run-tests consistently. ## 🔗 Related Issues - [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) — root-cause analysis and fix proposal (this PR implements the proposed fix). - Unblocks the v0.2.1 release ceremony (validation of the `PLAINSIGHT_PYPI_TOKEN` publish path was caught by this flake before reaching the PyPI upload step). ## ✅ Checklist - [x] I have read and agreed to the terms of the [LICENSE](../LICENSE) - [x] I have read the [CONTRIBUTING](../CONTRIBUTING.md) guide - [x] I have followed the [coding style](../CONTRIBUTING.md#coding-style) - [x] I have signed all commits in compliance with the DCO (`git commit -s`) - [x] I have added or updated **tests** as needed (the fix IS the test change; helper is documented in-line) - [x] I have added or updated **documentation** as needed (helper has a docstring explaining the FILTER-461 context) Signed-off-by: stwilt <swilt@plainsight.ai>
stwilt
added a commit
that referenced
this pull request
May 15, 2026
…R-461) (#94) ## 📋 What does this PR do? Adds a `wait_for_runner_exit(runner, timeout=5.0)` helper in `tests/test_filter.py` that polls `runner.step()` until it returns something other than `False` (or times out), and replaces five fragile single-step shutdown assertions across four `TestFilterOld` tests: - `test_topo_balance_step` - `test_topo_ephemeral_join_step` - `test_topo_doubly_ephemeral_tee_step` (two callsites) - `test_topo_subscribe_wildcard_all_step` Each callsite previously asserted that a single `runner.step()` immediately after `qout.put(None)` would return the exit-codes list (`[0] * N` for N filters). With the helper, the assertion now waits up to 5 s for the runner to actually report termination — preserving the original assertion semantics, but with the same `step → settle → assert` discipline the data-frame iterations already use. ## 🔍 Why is this needed? `test_topo_balance_step` started failing on Python 3.10 only as of [Create Release dry-run run #25936326919](https://github.com/PlainsightAI/openfilter/actions/runs/25936326919), while passing on 3.11 / 3.12 / 3.13. The same test passed on 3.10 for the v0.1.30 release ([run #24800470941](https://github.com/PlainsightAI/openfilter/actions/runs/24800470941), 2026-04-22). Between those, #82 introduced the zero-copy IPC transport with SHM — a perf rewrite of the layer this test exercises. The faster IPC shifted timing margins such that on 3.10's GIL/scheduling, at least one of N filters routinely hasn't yet drained the `None` shutdown sentinel + reported its exit code by the time the next `step()` runs. Result: `step()` returns `False` ("not done") instead of `[0, 0, …]`. This is **test fragility, not a runtime regression** — the 9 data-frame iterations in the same test exercise the full pipeline correctly across 3.10 too. But it was also a **release blocker**: the Create Release workflow gates on every matrix Python version passing run-tests, so any release dispatch was rolling dice on the 3.10 job. See [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) for the full root-cause analysis. ## 🧪 How was it tested? Set up a fresh Python 3.10 venv with the project in editable mode (`uv venv .venv-310 --python 3.10`, `uv pip install -e ".[all]"`, install pytest), and ran the failing test: ``` $ .venv-310/bin/pytest tests/test_filter.py::TestFilterOld::test_topo_balance_step -v … tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED [100%] 1 passed in 3.66s ``` Repeated 3× consecutively — all green. Also ran all four patched tests together — all pass: ``` tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED [ 25%] tests/test_filter.py::TestFilterOld::test_topo_ephemeral_join_step PASSED [ 50%] tests/test_filter.py::TestFilterOld::test_topo_doubly_ephemeral_tee_step PASSED [ 75%] tests/test_filter.py::TestFilterOld::test_topo_subscribe_wildcard_all_step PASSED [100%] 4 passed in 8.72s ``` Final CI confirmation will come from the Create Release re-dispatch after this merges — all four Python matrix versions should pass run-tests consistently. ## 🔗 Related Issues - [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) — root-cause analysis and fix proposal (this PR implements the proposed fix). - Unblocks the v0.2.1 release ceremony (validation of the `PLAINSIGHT_PYPI_TOKEN` publish path was caught by this flake before reaching the PyPI upload step). ## ✅ Checklist - [x] I have read and agreed to the terms of the [LICENSE](../LICENSE) - [x] I have read the [CONTRIBUTING](../CONTRIBUTING.md) guide - [x] I have followed the [coding style](../CONTRIBUTING.md#coding-style) - [x] I have signed all commits in compliance with the DCO (`git commit -s`) - [x] I have added or updated **tests** as needed (the fix IS the test change; helper is documented in-line) - [x] I have added or updated **documentation** as needed (helper has a docstring explaining the FILTER-461 context) [FILTER-461]: https://plainsight-ai.atlassian.net/browse/FILTER-461?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Signed-off-by: stwilt <swilt@plainsight.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📦 Pull Request
📋 What does this PR do?
Eliminates redundant image copies on the MQ inter-filter transport hot path through four layered optimizations, then adds an opt-in POSIX shared-memory transport that bypasses the kernel socket buffer entirely. Also adds a terminal-based waterfall display for visualizing the IPC performance gains across git refs.
Four perf optimizations (cumulative on 4K native, 2 filters, raw):
Plus a terminal waterfall display (make waterfall) that runs the bench harness's pipeline_simulation scenarios at two git refs and renders a rich-styled before/after comparison with per-pipeline gains and per-stage delta annotations. Useful for PR descriptions and demos.
🔍 Why is this needed?
The IPC benchmark harness (PR #77 / FILTER-415) showed that 4K-native raw pipelines spent 63% of their per-frame budget on IPC overhead — the cost of moving 24MB image buffers between filters via ZMQ was larger than the filter processing itself. The four optimizations in this PR reduce that overhead by 83% (35ms → 5.7ms per frame at 4K), lifting max throughput from 18 fps to 39 fps on the same hardware without touching any filter code.
The waterfall display was built to show these gains — it runs the same bench scenarios against any two git refs (e.g., HEAD vs origin/main) and renders a terminal panel showing exactly where the time moved. It's anchored to the bench harness's TestPipelineSimulation so it automatically follows any future runtime composition changes.
🧪 How was it tested?
Measured results (4K native, 2 filters, raw):
🔗 Related Issues
🖼️ Screenshots or Logs (if applicable)
The make waterfall output (run with --frames 100 --baseline-ref origin/main) shows the per-pipeline gains and a frame waterfall for the 4K case. Best viewed in a terminal with truecolor support; Plainsight brand palette applied to structural chrome, conventional green/red for diagnostic signals.
✅ Checklist