Skip to content

fix(tests): poll for runner exit instead of single-step assert (FILTER-461)#94

Merged
stwilt merged 1 commit into
mainfrom
fix/filter-461-topo-balance-shutdown-poll
May 15, 2026
Merged

fix(tests): poll for runner exit instead of single-step assert (FILTER-461)#94
stwilt merged 1 commit into
mainfrom
fix/filter-461-topo-balance-shutdown-poll

Conversation

@stwilt
Copy link
Copy Markdown
Contributor

@stwilt stwilt commented May 15, 2026

📋 What does this PR do?

Adds a wait_for_runner_exit(runner, timeout=5.0) helper in tests/test_filter.py that polls runner.step() until it returns something other than False (or times out), and replaces five fragile single-step shutdown assertions across four TestFilterOld tests:

  • test_topo_balance_step
  • test_topo_ephemeral_join_step
  • test_topo_doubly_ephemeral_tee_step (two callsites)
  • test_topo_subscribe_wildcard_all_step

Each callsite previously asserted that a single runner.step() immediately after qout.put(None) would return the exit-codes list ([0] * N for N filters). With the helper, the assertion now waits up to 5 s for the runner to actually report termination — preserving the original assertion semantics, but with the same step → settle → assert discipline the data-frame iterations already use.

🔍 Why is this needed?

test_topo_balance_step started failing on Python 3.10 only as of Create Release dry-run run #25936326919, while passing on 3.11 / 3.12 / 3.13. The same test passed on 3.10 for the v0.1.30 release (run #24800470941, 2026-04-22). Between those, #82 introduced the zero-copy IPC transport with SHM — a perf rewrite of the layer this test exercises. The faster IPC shifted timing margins such that on 3.10's GIL/scheduling, at least one of N filters routinely hasn't yet drained the None shutdown sentinel + reported its exit code by the time the next step() runs. Result: step() returns False ("not done") instead of [0, 0, …].

This is test fragility, not a runtime regression — the 9 data-frame iterations in the same test exercise the full pipeline correctly across 3.10 too. But it was also a release blocker: the Create Release workflow gates on every matrix Python version passing run-tests, so any release dispatch was rolling dice on the 3.10 job.

See FILTER-461 for the full root-cause analysis.

🧪 How was it tested?

Set up a fresh Python 3.10 venv with the project in editable mode (uv venv .venv-310 --python 3.10, uv pip install -e ".[all]", install pytest), and ran the failing test:

$ .venv-310/bin/pytest tests/test_filter.py::TestFilterOld::test_topo_balance_step -v
…
tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED  [100%]
1 passed in 3.66s

Repeated 3× consecutively — all green. Also ran all four patched tests together — all pass:

tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED  [ 25%]
tests/test_filter.py::TestFilterOld::test_topo_ephemeral_join_step PASSED  [ 50%]
tests/test_filter.py::TestFilterOld::test_topo_doubly_ephemeral_tee_step PASSED  [ 75%]
tests/test_filter.py::TestFilterOld::test_topo_subscribe_wildcard_all_step PASSED  [100%]
4 passed in 8.72s

Final CI confirmation will come from the Create Release re-dispatch after this merges — all four Python matrix versions should pass run-tests consistently.

🔗 Related Issues

  • FILTER-461 — root-cause analysis and fix proposal (this PR implements the proposed fix).
  • Unblocks the v0.2.1 release ceremony (validation of the PLAINSIGHT_PYPI_TOKEN publish path was caught by this flake before reaching the PyPI upload step).

✅ Checklist

  • I have read and agreed to the terms of the LICENSE
  • I have read the CONTRIBUTING guide
  • I have followed the coding style
  • I have signed all commits in compliance with the DCO (git commit -s)
  • I have added or updated tests as needed (the fix IS the test change; helper is documented in-line)
  • I have added or updated documentation as needed (helper has a docstring explaining the FILTER-461 context)

@stwilt stwilt requested a review from lucasmundim May 15, 2026 21:17
Copy link
Copy Markdown
Contributor

@lucasmundim lucasmundim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the polling helper is sound and the call-site replacements are correct. Two non-blocking nits below.

Comment thread tests/test_filter.py Outdated
Comment thread tests/test_filter.py
…R-461)

## 📋 What does this PR do?

Adds a `wait_for_runner_exit(runner, timeout=5.0)` helper in
`tests/test_filter.py` that polls `runner.step()` until it returns
something other than `False` (or times out), and replaces five fragile
single-step shutdown assertions across four `TestFilterOld` tests:

- `test_topo_balance_step`
- `test_topo_ephemeral_join_step`
- `test_topo_doubly_ephemeral_tee_step` (two callsites)
- `test_topo_subscribe_wildcard_all_step`

Each callsite previously asserted that a single `runner.step()` immediately
after `qout.put(None)` would return the exit-codes list (`[0] * N` for N
filters). With the helper, the assertion now waits up to 5 s for the
runner to actually report termination — preserving the original assertion
semantics, but with the same `step → settle → assert` discipline the
data-frame iterations already use.

## 🔍 Why is this needed?

`test_topo_balance_step` started failing on Python 3.10 only as of
[Create Release dry-run run #25936326919](https://github.com/PlainsightAI/openfilter/actions/runs/25936326919),
while passing on 3.11 / 3.12 / 3.13. The same test passed on 3.10 for the
v0.1.30 release ([run #24800470941](https://github.com/PlainsightAI/openfilter/actions/runs/24800470941),
2026-04-22). Between those, #82 introduced the zero-copy IPC transport
with SHM — a perf rewrite of the layer this test exercises. The faster
IPC shifted timing margins such that on 3.10's GIL/scheduling, at least
one of N filters routinely hasn't yet drained the `None` shutdown
sentinel + reported its exit code by the time the next `step()` runs.
Result: `step()` returns `False` ("not done") instead of `[0, 0, …]`.

This is **test fragility, not a runtime regression** — the 9 data-frame
iterations in the same test exercise the full pipeline correctly across
3.10 too. But it was also a **release blocker**: the Create Release
workflow gates on every matrix Python version passing run-tests, so any
release dispatch was rolling dice on the 3.10 job.

See [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461)
for the full root-cause analysis.

## 🧪 How was it tested?

Set up a fresh Python 3.10 venv with the project in editable mode (`uv
venv .venv-310 --python 3.10`, `uv pip install -e ".[all]"`, install
pytest), and ran the failing test:

```
$ .venv-310/bin/pytest tests/test_filter.py::TestFilterOld::test_topo_balance_step -v
…
tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED  [100%]
1 passed in 3.66s
```

Repeated 3× consecutively — all green. Also ran all four patched tests
together — all pass:

```
tests/test_filter.py::TestFilterOld::test_topo_balance_step PASSED  [ 25%]
tests/test_filter.py::TestFilterOld::test_topo_ephemeral_join_step PASSED  [ 50%]
tests/test_filter.py::TestFilterOld::test_topo_doubly_ephemeral_tee_step PASSED  [ 75%]
tests/test_filter.py::TestFilterOld::test_topo_subscribe_wildcard_all_step PASSED  [100%]
4 passed in 8.72s
```

Final CI confirmation will come from the Create Release re-dispatch after
this merges — all four Python matrix versions should pass run-tests
consistently.

## 🔗 Related Issues

- [FILTER-461](https://plainsight-ai.atlassian.net/browse/FILTER-461) —
  root-cause analysis and fix proposal (this PR implements the proposed fix).
- Unblocks the v0.2.1 release ceremony (validation of the
  `PLAINSIGHT_PYPI_TOKEN` publish path was caught by this flake before
  reaching the PyPI upload step).

## ✅ Checklist

- [x] I have read and agreed to the terms of the [LICENSE](../LICENSE)
- [x] I have read the [CONTRIBUTING](../CONTRIBUTING.md) guide
- [x] I have followed the [coding style](../CONTRIBUTING.md#coding-style)
- [x] I have signed all commits in compliance with the DCO (`git commit -s`)
- [x] I have added or updated **tests** as needed (the fix IS the test
      change; helper is documented in-line)
- [x] I have added or updated **documentation** as needed (helper has
      a docstring explaining the FILTER-461 context)

Signed-off-by: stwilt <swilt@plainsight.ai>
@stwilt stwilt force-pushed the fix/filter-461-topo-balance-shutdown-poll branch from 75901fd to b018167 Compare May 15, 2026 23:19
@stwilt stwilt merged commit d936bc8 into main May 15, 2026
11 checks passed
@stwilt stwilt deleted the fix/filter-461-topo-balance-shutdown-poll branch May 15, 2026 23:27
stwilt added a commit that referenced this pull request May 16, 2026
Addresses two non-blocking observations from shingonoide on #96:

1. Adds a bullet under `### Infrastructure` for **#85**
   (`feat(cascade): tag-triggered bump-PR cascade`), the foundational
   DT-145 work that replaced `cloudbuild-cascade.yaml`. The previous
   draft credited the follow-on #93 (cascade upper-bound widening) but
   not #85, which read as an editorial inconsistency rather than a
   deliberate scoping choice.

2. Extends the preamble so RELEASE.md is self-contained on the rollup
   semantics: explicitly notes that DT-145 (#85 / #93) and FILTER-461
   (#94) are folded in here despite being merged *after* the original
   v0.2.0 changelog window. The PR body already documented this; now
   future readers don't have to reach for the PR.

No changes to the FILTER-44x / OTLP TLS / FILTER-461 entries; no
version bump; parser regex still matches `v0.2.1 - 2026-05-16`.

Signed-off-by: stwilt <swilt@plainsight.ai>
stwilt added a commit that referenced this pull request May 16, 2026
## 📋 What does this PR do?

Edits `RELEASE.md` so the top entry collapses the previously-separate
`v0.2.0` (2026-05-08, FILTER-441/442/444/452 declarative-config work)
and `v0.2.1` (2026-05-09, OTLP TLS inference) entries under a single `##
v0.2.1 - 2026-05-16` header. `VERSION` already says `v0.2.1` and is not
modified by this PR. Also folds in two changes that shipped to main
*after* the v0.2.0 entry's authored date but before this rollup:

- DT-145 cascade-widening (#93) under `### Infrastructure`
- FILTER-461 test-fragility fix (#94) under `### Fixed`

## 🔍 Why is this needed?

`v0.2.0` was changelogged-but-never-tagged. The Create Release
workflow's `changelog-parser-action` reads only the top entry off
`RELEASE.md`; dispatching against main as-is would have tagged `v0.2.1`
with release notes covering only the OTLP TLS change, leaving the
FILTER-44x declarative-config work shipped-but-invisible in the GitHub
release UI.

Choosing rollup-into-v0.2.1 over the alternative paths:
- **Cut v0.2.0 first then v0.2.1 sequentially**: requires a temp branch
resetting `VERSION` to `v0.2.0` and dispatching from there — three
cascades over a few days, churn across 30 consumer repos.
- **Just cut v0.2.1 as-is**: loses the FILTER-44x notes on the GitHub
release page; consumers reading the release page would not see the
declarative-config work.

Roll-up is fastest to prod and gives reviewers a complete view of what's
actually shipping under the v0.2.1 tag.

## 🧪 How was it tested?

- Parser regex compatibility verified locally — the
`changelog-parser-action` regex
`^v([0-9]+)\.([0-9]+)\.([0-9]+)(?:\.([0-9]+)-((?:dev|rc|int)))?(?:\s*-\s*(\d{4}-\d{2}-\d{2}))?$`
matches `v0.2.1 - 2026-05-16` and extracts `(0, 2, 1, None, None,
'2026-05-16')`.
- `VERSION` and the new RELEASE.md top header agree on `v0.2.1`,
satisfying the workflow's mandatory match check.
- No code change; one file modified (`RELEASE.md`), pure content edit.

## 🔗 Related Issues

After merge, dispatching `Create Release` from main will tag `v0.2.1`,
build + publish to PyPI under the rolled-up notes, and fire
`cascade-on-tag.yaml` against all 30 eligible filter-* repos (per the
earlier dry-run discover sweep). The cascade-widening shipped in #93
means consumers pinning `>=0.1.30,<0.2.0` will get a constraint-widening
PR opened in their repos rather than being silently skipped.

Final two ops items before dispatch:
- Confirm gitops PR #117 (filter telemetry endpoint `http://` prefix) is
merged so newly-released filter pods don't break on TLS-by-default
against the plaintext in-cluster collector.
- Decide on cascade auto-merge stance: default is on; can override
per-cascade via `workflow_dispatch` if a phased rollout is desired.

## ✅ Checklist

- [x] I have read and agreed to the terms of the [LICENSE](../LICENSE)
- [x] I have read the [CONTRIBUTING](../CONTRIBUTING.md) guide
- [x] I have followed the [coding
style](../CONTRIBUTING.md#coding-style)
- [x] I have signed all commits in compliance with the DCO (`git commit
-s`)
- [x] I have added or updated **tests** as needed (documentation-only
change; no test impact)
- [x] I have added or updated **documentation** as needed (this PR *is*
the documentation change)

---------

Signed-off-by: stwilt <swilt@plainsight.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants