test: de-flake watchdog + pheromone-scaling assertions; reconcile TEST_INVENTORY#16
Merged
Merged
Conversation
… under ctest -j Found by an empirical flake loop (full suite x10 under `ctest -j4` on WSL2/Linux, matching CI): two tests fail rarely under -j oversubscription. orchestration_test §4 (watchdog timeout): a missed heartbeat must be detected by the monitor thread, which the test waited only 100 ms for — under -j contention on a 2-vCPU box that thread can be descheduled longer, a false-negative flake. Use a generous 2 s deadline (loop still exits the instant the fault fires, ~12 ms normally). §3 (heartbeat keeps node alive): bump the node timeout 500 ms -> 2 s so a stretched inter-heartbeat sleep can't trip a false-positive fault. Neither masks anything — both still assert the watchdog's actual behaviour. stigmergy_pheromone_test §10: asserted a hard scaling-SHAPE floor (r4 >= 2.0x r1) on non-CI machines. That is a PERFORMANCE claim, inherently contention-sensitive, and it flaked under `ctest -j` even on a dev box (it was only skipped when CI=true). The unit test now asserts FUNCTIONAL throughput only (each thread count produces work); the scaling shape is measured properly, with affinity pinning, in bench/ — its correct home. Drops the CI-vs-dev branch entirely → deterministic everywhere. Verified: full suite 0/18 flaky under `ctest -j4` after the fixes (orchestration was 1/10, stigmergy_pheromone 1/12 before); both pass on WSL gcc-13 + Windows MinGW. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…proved) code The §3.1 matrix still tagged bench_ring_channel / bench_circuit_breaker / bench_frame_arena / bench_hal_primitives as Tier C with "add warmup / replace 0xDEADBEEF DCE hack" — but the V1 harness migration already did that (verified in source: all use bench::measure_repeated + bench::escape(); the hacks are gone). Bump them C->B to match the code and BENCHMARK_FAIRNESS.md. Likewise the §3.2 / §4 comparison rows still described the OLD broken state (14x MPMC import, 178x submit-only, gRPC inproc-transport lie) — contradicting BENCHMARK_FAIRNESS.md, which records these as resolved (D-1/D-2/D-3). Defer the ratios to that SoT and mark the remaining open item honestly: independent re-run of the comparison numbers on this machine (Boost/Taskflow/gRPC available on WSL; concurrencpp pending). Also: bench::escape() promotion to BenchHarness.hpp is done (D2 resolved); refresh the §7 summary and the cross-cutting issues list. No code change — documentation reconciliation only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on benches Rebuilt and re-ran ring_vs_boost_lockfree, pool_vs_taskflow, pool_vs_concurrencpp, and pn1_vs_grpc on the WSL env (g++-13, Release, pinned to CCD0). All four third-party libs are present (concurrencpp included — the earlier "pending" note was wrong). Each bench completes losslessly — confirming the D-1 livelock, D-2 pool task-loss, and D-3 loopback fixes — and wins in the direction BENCHMARK_FAIRNESS.md records: SPSC both lossless, MPMC ~4×, pool submit→completion ~2.1×/~3.8× (both 200000/200000), concurrencpp large (genuine coroutine overhead), pn1_vs_grpc 3.6× p50 with the not-like-for-like caveat printed first. Exact magnitudes vary with the machine/V-Cache pinning, so the SoT medians stay authoritative for the ratios; this run confirms direction + lossless completion. Raw capture added under docs/perf-history/. Updates footnote 8, the §4 map rows, and the §7 summary accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
COMPARISONS.md was reconciled to BENCHMARK_FAIRNESS.md on 2026-05-21 (honest summary deferred to the SoT; stale section tables carry ⛔ Superseded banners with the old 5.5×/14×/18×/178×/1.41× numbers retired in-place). Update the inventory's "documentation inconsistencies" list to reflect that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Found via an empirical flake loop (full suite ×10 under
ctest -j4on WSL2/Linux,matching the CI runner), then fixed deterministically:
tight under -j → monitor thread starved); §3 false-positive (500 ms node timeout
vs stretched inter-heartbeat sleeps). Both now use generous deadlines; both still
assert the watchdog's real behaviour.
r4 >= 2.0× r1)is a perf claim, not correctness; it flaked under
ctest -jeven off-CI. Nowasserts functional throughput only; scaling shape stays in
bench/.Verified 0/18 flaky under
ctest -j4after the fixes (was 1/10 and 1/12 before);pass on WSL gcc-13 + Windows MinGW.
Also reconciles TEST_INVENTORY.md with the actual code: 4 microbenches were
already migrated to the v2 harness (warmup +
escape(); the0xDEADBEEFhacks aregone) but still tagged Tier C — bumped C→B; the comparison verdicts now defer to
BENCHMARK_FAIRNESS.md(the SoT) instead of describing the old broken state.Test plan
ctest -j4include/change so doc-sync is unaffected🤖 Generated with Claude Code