Skip to content

perf(mq): zero-copy IPC transport with SHM + perf waterfall display#82

Merged
stwilt merged 4 commits into
mainfrom
feat/filter-419-ipc-perf
Apr 20, 2026
Merged

perf(mq): zero-copy IPC transport with SHM + perf waterfall display#82
stwilt merged 4 commits into
mainfrom
feat/filter-419-ipc-perf

Conversation

@stwilt
Copy link
Copy Markdown
Contributor

@stwilt stwilt commented Apr 19, 2026

📦 Pull Request

📋 What does this PR do?

Eliminates redundant image copies on the MQ inter-filter transport hot path through four layered optimizations, then adds an opt-in POSIX shared-memory transport that bypasses the kernel socket buffer entirely. Also adds a terminal-based waterfall display for visualizing the IPC performance gains across git refs.

Four perf optimizations (cumulative on 4K native, 2 filters, raw):

  1. Replace bytearray(memoryview(frame.image)) with memoryview.cast('B') in frames2topicmsgs — drops a full pre-send image copy (~17ms → ~10ms/hop)
  2. recv_multipart(copy=False) on the ZMQ SUB socket — receiver gets zmq.Frame objects and np.frombuffer builds zero-copy views (~10ms → ~7.5ms/hop)
  3. send_multipart(copy=False) on the ZMQ PUB socket — sender hands the buffer to ZMQ without an internal copy (~7.5ms → ~6.5ms/hop)
  4. Opt-in SHM transport (OPENFILTER_SHM_ENABLE=1) — sender writes image into a POSIX shm ring slot and sends only a tiny reference over ZMQ; receiver attaches once and builds a zero-copy numpy view (~6.5ms → ~2.9ms/hop)

Plus a terminal waterfall display (make waterfall) that runs the bench harness's pipeline_simulation scenarios at two git refs and renders a rich-styled before/after comparison with per-pipeline gains and per-stage delta annotations. Useful for PR descriptions and demos.

🔍 Why is this needed?

The IPC benchmark harness (PR #77 / FILTER-415) showed that 4K-native raw pipelines spent 63% of their per-frame budget on IPC overhead — the cost of moving 24MB image buffers between filters via ZMQ was larger than the filter processing itself. The four optimizations in this PR reduce that overhead by 83% (35ms → 5.7ms per frame at 4K), lifting max throughput from 18 fps to 39 fps on the same hardware without touching any filter code.

The waterfall display was built to show these gains — it runs the same bench scenarios against any two git refs (e.g., HEAD vs origin/main) and renders a terminal panel showing exactly where the time moved. It's anchored to the bench harness's TestPipelineSimulation so it automatically follows any future runtime composition changes.

🧪 How was it tested?

  • make test passes (--benchmark-disable, existing test suite)
  • make bench passes (pipeline_simulation shows reduced IPC overhead)
  • OPENFILTER_SHM_ENABLE=1 make test passes (SHM path exercised)
  • make waterfall renders correctly against origin/main
  • make waterfall.profile produces /tmp/perf-waterfall.svg flamegraph
  • py-spy profiling confirmed bottleneck attribution at each stage
  • 200-frame bench-mirror reproducer verified gains with realistic access patterns (fresh .ro per iteration + filter sleep)

Measured results (4K native, 2 filters, raw):

Optimization            IPC OH/frame   Per hop    Max FPS
Baseline (main)         ~35 ms         ~17.5 ms   18
+ memoryview send       ~20 ms         ~10 ms     25
+ recv copy=False       ~15 ms         ~7.5 ms    29
+ send copy=False       ~13 ms         ~6.5 ms    30
+ SHM transport         ~5.7 ms        ~2.9 ms    39

🔗 Related Issues

  • FILTER-419: Investigate shared-memory frame transport (this PR)
  • FILTER-415: IPC benchmark harness (PR feat(tests): add IPC/serialization overhead benchmarking harness #77, merged — provides the measurement infrastructure this PR optimizes against)
  • FILTER-367: Inference Performance & Scaling (parent epic)
  • PLAT-866: OTel hop/sub-hop span instrumentation (follow-up — the same runtime paths instrumented here will get OTel spans)

🖼️ Screenshots or Logs (if applicable)

The make waterfall output (run with --frames 100 --baseline-ref origin/main) shows the per-pipeline gains and a frame waterfall for the 4K case. Best viewed in a terminal with truecolor support; Plainsight brand palette applied to structural chrome, conventional green/red for diagnostic signals.

✅ Checklist

  • I have read and agreed to the terms of the LICENSE
  • I have read the CONTRIBUTING guide
  • I have followed the coding style
  • I have signed all commits in compliance with the DCO
  • I have added or updated tests as needed
  • I have added or updated documentation as needed

…FILTER-419)

# 📦 Pull Request

## 📋 What does this PR do?

Eliminates redundant image copies on the MQ inter-filter transport
hot path through four layered optimizations, then adds an opt-in
POSIX shared-memory transport that bypasses the kernel socket buffer
entirely. Also adds a terminal-based waterfall display for visualizing
the IPC performance gains across git refs.

Four perf optimizations (cumulative on 4K native, 2 filters, raw):

1. Replace bytearray(memoryview(frame.image)) with memoryview.cast('B')
   in frames2topicmsgs — drops a full pre-send image copy (~17ms → ~10ms/hop)
2. recv_multipart(copy=False) on the ZMQ SUB socket — receiver gets
   zmq.Frame objects and np.frombuffer builds zero-copy views (~10ms → ~7.5ms/hop)
3. send_multipart(copy=False) on the ZMQ PUB socket — sender hands the
   buffer to ZMQ without an internal copy (~7.5ms → ~6.5ms/hop)
4. Opt-in SHM transport (OPENFILTER_SHM_ENABLE=1) — sender writes image
   into a POSIX shm ring slot and sends only a tiny reference over ZMQ;
   receiver attaches once and builds a zero-copy numpy view (~6.5ms → ~2.9ms/hop)

Plus a terminal waterfall display (make waterfall) that runs the bench
harness's pipeline_simulation scenarios at two git refs and renders a
rich-styled before/after comparison with per-pipeline gains and
per-stage delta annotations. Useful for PR descriptions and demos.

## 🔍 Why is this needed?

The IPC benchmark harness (PR #77 / FILTER-415) showed that 4K-native
raw pipelines spent 63% of their per-frame budget on IPC overhead —
the cost of moving 24MB image buffers between filters via ZMQ was
larger than the filter processing itself. The four optimizations in
this PR reduce that overhead by 83% (35ms → 5.7ms per frame at 4K),
lifting max throughput from 18 fps to 39 fps on the same hardware
without touching any filter code.

The waterfall display was built to show these gains — it runs the same
bench scenarios against any two git refs (e.g., HEAD vs origin/main)
and renders a terminal panel showing exactly where the time moved.
It's anchored to the bench harness's TestPipelineSimulation so it
automatically follows any future runtime composition changes.

## 🧪 How was it tested?

- make test passes (--benchmark-disable, existing test suite)
- make bench passes (pipeline_simulation shows reduced IPC overhead)
- OPENFILTER_SHM_ENABLE=1 make test passes (SHM path exercised)
- make waterfall renders correctly against origin/main
- make waterfall.profile produces /tmp/perf-waterfall.svg flamegraph
- py-spy profiling confirmed bottleneck attribution at each stage
- 200-frame bench-mirror reproducer verified gains with realistic
  access patterns (fresh .ro per iteration + filter sleep)

Measured results (4K native, 2 filters, raw):

    Optimization            IPC OH/frame   Per hop    Max FPS
    Baseline (main)         ~35 ms         ~17.5 ms   18
    + memoryview send       ~20 ms         ~10 ms     25
    + recv copy=False       ~15 ms         ~7.5 ms    29
    + send copy=False       ~13 ms         ~6.5 ms    30
    + SHM transport         ~5.7 ms        ~2.9 ms    39

## 🔗 Related Issues

- FILTER-419: Investigate shared-memory frame transport (this PR)
- FILTER-415: IPC benchmark harness (PR #77, merged — provides the
  measurement infrastructure this PR optimizes against)
- FILTER-367: Inference Performance & Scaling (parent epic)
- PLAT-866: OTel hop/sub-hop span instrumentation (follow-up — the
  same runtime paths instrumented here will get OTel spans)

## 🖼️ Screenshots or Logs (if applicable)

The make waterfall output (run with --frames 100 --baseline-ref origin/main)
shows the per-pipeline gains and a frame waterfall for the 4K case.
Best viewed in a terminal with truecolor support; Plainsight brand
palette applied to structural chrome, conventional green/red for
diagnostic signals.

## ✅ Checklist

- [x] I have read and agreed to the terms of the LICENSE
- [x] I have read the CONTRIBUTING guide
- [x] I have followed the coding style
- [x] I have signed all commits in compliance with the DCO
- [x] I have added or updated tests as needed
- [x] I have added or updated documentation as needed

Signed-off-by: stwilt <swilt@plainsight.ai>
…t bugs

Three bundled fixes for CI failures on this branch:

1. mq.py: After np.frombuffer in topicmsgs2frames (both the zero-copy
   ZMQ path and the SHM path), set image.flags.writeable = False. The
   previous bytearray-based path produced read-only arrays implicitly;
   zero-copy frombuffer views are writeable by default. Three consumers
   depend on received frames being immutable:
     - SHM slots are recycled round-robin by the sender, so any
       downstream mutation would corrupt live buffers.
     - Frame.copy() skips the deep copy only when image is read-only.
     - Frame.jpg caches the encoded blob only when image is read-only.

2. test_zeromq.py: ZMQReceiver.recv_multipart(copy=False) makes payload
   parts arrive as zmq.Frame wrappers rather than bytes. Teach the
   recv/recvl test helpers (and the OOB callback lambdas) to materialize
   zmq.Frame -> bytes for comparison. These tests exercise ZMQ routing,
   not buffer type, so coercion is the right level to fix at.

3. test_filter.py: test_filter_context_read_models_toml{,_invalid_format}
   wrote a TOML to /tmp and then tried to rename() it into cwd as
   models.toml. On hosts where /tmp is a separate filesystem this fails
   with EXDEV. Drop tempfile; write directly via Path.write_text inside
   a small context-manager helper that also handles pre-existing backup.

Signed-off-by: stwilt <swilt@plainsight.ai>
Copy link
Copy Markdown
Contributor

@lucasmundim lucasmundim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong performance PR with impressive measured gains (83% IPC overhead reduction at 4K). The layered approach is clean — each optimization is independently valuable and they compose well. CI green across Python 3.10-3.13.

What looks good:

  • Four independent optimizations that compose cleanly (memoryview send, recv copy=False, send copy=False, SHM)
  • SHM is fully opt-in (OPENFILTER_SHM_ENABLE=1) — zero behavioral change for existing deployments
  • image.flags.writeable = False on received frames — prevents downstream mutation of SHM/zero-copy buffers
  • Stale SHM segment cleanup on startup (sweeps prior crashed runs)
  • SHMAttachCache lazily attaches — no upfront allocation on the receiver side
  • Pipeline scenarios lifted to module scope for tooling reuse — DRY between bench and measurement scripts
  • Per-hop IPC timing added to the bench harness (hop_times_ms) — good observability
  • Waterfall display uses PEP 723 inline deps — no pyproject changes needed
  • Test fixture cleanup in test_filter.py (_with_models_toml context manager) is a nice improvement
  • test_zeromq.py properly handles zmq.Frame objects from copy=False

Five observations inline.

Comment thread openfilter/filter_runtime/shm_transport.py
Comment thread openfilter/filter_runtime/shm_transport.py
Comment thread openfilter/filter_runtime/mq.py
Comment thread openfilter/filter_runtime/mq.py
Comment thread openfilter/filter_runtime/mq.py
stwilt added 2 commits April 19, 2026 20:14
…iant

Addresses two concerns raised in review on PR #82:

- SHMPool.put() now raises ValueError with a clear message when the
  frame payload exceeds slot_size, rather than silently corrupting or
  producing an opaque error. The message names the overflow size and
  points to OPENFILTER_SHM_SLOT_SIZE.

- Added an invariant comment on SHMPool explaining that the
  round-robin slot reuse is only safe because the current ZMQ
  send->recv pipeline loop is synchronous (sender blocks until
  receiver consumes). Async/pipelined dispatch would require a
  generation counter or semaphore.

Signed-off-by: stwilt <swilt@plainsight.ai>
Follow-up to d310e28 (read-only invariant on received frames). Adds a
Breaking Changes note to the current release section explaining that
filters mutating frame.image in place will now raise
"ValueError: assignment destination is read-only", with the migration
path (copy first, or build a new Frame).

Signed-off-by: stwilt <swilt@plainsight.ai>
@stwilt stwilt requested a review from lucasmundim April 20, 2026 01:23
Copy link
Copy Markdown
Contributor

@lucasmundim lucasmundim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All five observations addressed cleanly: bounds check on SHM slot overflow, synchrony invariant documented, breaking change for read-only frames in RELEASE.md. LGTM.

@stwilt stwilt merged commit 72586c2 into main Apr 20, 2026
10 checks passed
@stwilt stwilt deleted the feat/filter-419-ipc-perf branch April 20, 2026 05:17
stwilt added a commit that referenced this pull request May 15, 2026
…R-461)

## 📋 What does this PR do?

Adds a `wait_for_runner_exit(runner, timeout=5.0)` helper in
`tests/test_filter.py` that polls `runner.step()` until it returns
something other than `False` (or times out), and replaces five fragile
single-step shutdown assertions across four `TestFilterOld` tests:

- `test_topo_balance_step`
- `test_topo_ephemeral_join_step`
- `test_topo_doubly_ephemeral_tee_step` (two callsites)
- `test_topo_subscribe_wildcard_all_step`

Each callsite previously asserted that a single `runner.step()` immediately
after `qout.put(None)` would return the exit-codes list (`[0] * N` for N
filters). With the helper, the assertion now waits up to 5 s for the
runner to actually report termination — preserving the original assertion
semantics, but with the same `step → settle → assert` discipline the
data-frame iterations already use.

## 🔍 Why is this needed?

`test_topo_balance_step` started failing on Python 3.10 only as of
[Create Release dry-run run #25936326919](https://github.com/PlainsightAI/openfilter/actions/runs/25936326919),
while passing on 3.11 / 3.12 / 3.13. The same test passed on 3.10 for the
v0.1.30 release ([run #24800470941](https://github.com/PlainsightAI/openfilter/actions/runs/24800470941),
2026-04-22). Between those, #82 introduced the zero-copy IPC transport
with SHM — a perf rewrite of the layer this test exercises. The faster
IPC shifted timing margins such that on 3.10's GIL/scheduling, at least
one of N filters routinely hasn't yet drained the `None` shutdown
sentinel + reported its exit code by the time the next `step()` runs.
Result: `step()` returns `False` ("not done") instead of `[0, 0, …]`.

This is **test fragility, not a runtime regression** — the 9 data-frame
iterations in the same test exercise the full pipeline correctly across
3.10 too. But it was also a **release blocker**: the Create Release
workflow gates on every matrix Python version passing run-tests, so any
release dispatch was rolling dice on the 3.10 job.

See [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461)
for the full root-cause analysis.

## 🧪 How was it tested?

Set up a fresh Python 3.10 venv with the project in editable mode (`uv
venv .venv-310 --python 3.10`, `uv pip install -e ".[all]"`, install
pytest), and ran the failing test:

```
$ .venv-310/bin/pytest tests/test_filter.py::TestFilterOld::test_topo_balance_step -v
…
tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED  [100%]
1 passed in 3.66s
```

Repeated 3× consecutively — all green. Also ran all four patched tests
together — all pass:

```
tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED  [ 25%]
tests/test_filter.py::TestFilterOld::test_topo_ephemeral_join_step PASSED  [ 50%]
tests/test_filter.py::TestFilterOld::test_topo_doubly_ephemeral_tee_step PASSED  [ 75%]
tests/test_filter.py::TestFilterOld::test_topo_subscribe_wildcard_all_step PASSED  [100%]
4 passed in 8.72s
```

Final CI confirmation will come from the Create Release re-dispatch after
this merges — all four Python matrix versions should pass run-tests
consistently.

## 🔗 Related Issues

- [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) —
  root-cause analysis and fix proposal (this PR implements the proposed fix).
- Unblocks the v0.2.1 release ceremony (validation of the
  `PLAINSIGHT_PYPI_TOKEN` publish path was caught by this flake before
  reaching the PyPI upload step).

## ✅ Checklist

- [x] I have read and agreed to the terms of the [LICENSE](../LICENSE)
- [x] I have read the [CONTRIBUTING](../CONTRIBUTING.md) guide
- [x] I have followed the [coding style](../CONTRIBUTING.md#coding-style)
- [x] I have signed all commits in compliance with the DCO (`git commit -s`)
- [x] I have added or updated **tests** as needed (the fix IS the test
      change; helper is documented in-line)
- [x] I have added or updated **documentation** as needed (helper has
      a docstring explaining the FILTER-461 context)

Signed-off-by: stwilt <swilt@plainsight.ai>
stwilt added a commit that referenced this pull request May 15, 2026
…R-461) (#94)

## 📋 What does this PR do?

Adds a `wait_for_runner_exit(runner, timeout=5.0)` helper in
`tests/test_filter.py` that polls `runner.step()` until it returns
something other than `False` (or times out), and replaces five fragile
single-step shutdown assertions across four `TestFilterOld` tests:

- `test_topo_balance_step`
- `test_topo_ephemeral_join_step`
- `test_topo_doubly_ephemeral_tee_step` (two callsites)
- `test_topo_subscribe_wildcard_all_step`

Each callsite previously asserted that a single `runner.step()`
immediately after `qout.put(None)` would return the exit-codes list
(`[0] * N` for N filters). With the helper, the assertion now waits up
to 5 s for the runner to actually report termination — preserving the
original assertion semantics, but with the same `step → settle → assert`
discipline the data-frame iterations already use.

## 🔍 Why is this needed?

`test_topo_balance_step` started failing on Python 3.10 only as of
[Create Release dry-run run
#25936326919](https://github.com/PlainsightAI/openfilter/actions/runs/25936326919),
while passing on 3.11 / 3.12 / 3.13. The same test passed on 3.10 for
the v0.1.30 release ([run
#24800470941](https://github.com/PlainsightAI/openfilter/actions/runs/24800470941),
2026-04-22). Between those, #82 introduced the zero-copy IPC transport
with SHM — a perf rewrite of the layer this test exercises. The faster
IPC shifted timing margins such that on 3.10's GIL/scheduling, at least
one of N filters routinely hasn't yet drained the `None` shutdown
sentinel + reported its exit code by the time the next `step()` runs.
Result: `step()` returns `False` ("not done") instead of `[0, 0, …]`.

This is **test fragility, not a runtime regression** — the 9 data-frame
iterations in the same test exercise the full pipeline correctly across
3.10 too. But it was also a **release blocker**: the Create Release
workflow gates on every matrix Python version passing run-tests, so any
release dispatch was rolling dice on the 3.10 job.

See [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461)
for the full root-cause analysis.

## 🧪 How was it tested?

Set up a fresh Python 3.10 venv with the project in editable mode (`uv
venv .venv-310 --python 3.10`, `uv pip install -e ".[all]"`, install
pytest), and ran the failing test:

```
$ .venv-310/bin/pytest tests/test_filter.py::TestFilterOld::test_topo_balance_step -v
…
tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED  [100%]
1 passed in 3.66s
```

Repeated 3× consecutively — all green. Also ran all four patched tests
together — all pass:

```
tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED  [ 25%]
tests/test_filter.py::TestFilterOld::test_topo_ephemeral_join_step PASSED  [ 50%]
tests/test_filter.py::TestFilterOld::test_topo_doubly_ephemeral_tee_step PASSED  [ 75%]
tests/test_filter.py::TestFilterOld::test_topo_subscribe_wildcard_all_step PASSED  [100%]
4 passed in 8.72s
```

Final CI confirmation will come from the Create Release re-dispatch
after this merges — all four Python matrix versions should pass
run-tests consistently.

## 🔗 Related Issues

- [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) —
root-cause analysis and fix proposal (this PR implements the proposed
fix).
- Unblocks the v0.2.1 release ceremony (validation of the
`PLAINSIGHT_PYPI_TOKEN` publish path was caught by this flake before
reaching the PyPI upload step).

## ✅ Checklist

- [x] I have read and agreed to the terms of the [LICENSE](../LICENSE)
- [x] I have read the [CONTRIBUTING](../CONTRIBUTING.md) guide
- [x] I have followed the [coding
style](../CONTRIBUTING.md#coding-style)
- [x] I have signed all commits in compliance with the DCO (`git commit
-s`)
- [x] I have added or updated **tests** as needed (the fix IS the test
change; helper is documented in-line)
- [x] I have added or updated **documentation** as needed (helper has a
docstring explaining the FILTER-461 context)


[FILTER-461]:
https://plainsight-ai.atlassian.net/browse/FILTER-461?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Signed-off-by: stwilt <swilt@plainsight.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants