BUG: Fix ParallelSparseFieldLevelSet deadlock with explicit std::threads#6321
BUG: Fix ParallelSparseFieldLevelSet deadlock with explicit std::threads#6321hjmjohnson wants to merge 2 commits into
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
ee2083a to
d395122
Compare
ParallelSparseFieldLevelSetImageFilter::Iterate() dispatches its
per-iteration ThreadedApplyUpdate body via mt->ParallelizeArray().
The body invokes SignalNeighborsAndWait() several times, blocking on
per-thread std::condition_variables until both neighbor work units
have arrived at the same phase. This neighbor-only protocol assumes
every work unit holds its own concurrent OS thread for the full
duration of the dispatch. Pool/TBB-backed ParallelizeArray does not
honor this: N work units are submitted to a shared worker pool whose
size is bounded by core count, and CV-blocked tasks pin the pool so
queued work units never start and the dispatch deadlocks.
Reported as the intermittent timeout of
itkParallelSparseFieldLevelSetImageFilterTest in CDash test 2570265561
("Trying mf->Update()" followed by a 1000s timeout) on Windows
Release with TBB.
The existing test fails at ~3% per run under TBB, which is too rare
to catch the bug reliably in CI. This commit adds a new test
itkParallelSparseFieldLevelSetImageFilterRobustnessTest that drives
the deadlock to a high per-test-invocation probability:
Scenario 1 (sweep x repeat): cycle 1/2/4/8/11/16/32 work-unit
counts for 20 outer iterations, totalling 140 short filter runs.
Per-run pool-starvation deadlock probability is low but
amplification gives a ~95% per-test-invocation hit rate.
Scenario 2 (determinism): three repeated runs at workUnits=11 must
produce bit-identical output. Guards against future barrier
rewrites that introduce numerical non-determinism.
Scenario 3 (concurrent multi-pipeline): eight pipelines run
simultaneously via std::async (88 work units competing for a
core-count-bounded worker pool). Cross-pipeline contention
directly probes the dispatch-starvation deadlock path.
Local measurements on a 16-core x86 box with TBB:
- Upstream code, 30 invocations: 30 timeouts (100% deadlock).
- Upstream code, POOL/PLATFORM: pass (no worker shortage possible).
- Test runtime with a working synchronization: ~5s wallclock.
- CTest TIMEOUT set to 30 (6x leeway over expected runtime).
This commit intentionally introduces a failing test on master. The
proposed fix lands in a follow-up PR.
Replace the one ParallelizeArray call in
ParallelSparseFieldLevelSetImageFilter::Iterate() that contains inner
SignalNeighborsAndWait calls with explicit std::thread spawning. The
inner sync requires every work unit to hold its own concurrent OS
thread for the full duration of the dispatch; ParallelizeArray's
pool/TBB backends cannot guarantee this when work-unit count exceeds
free worker count. std::thread guarantees one OS thread per work
unit so the neighbor-only sync protocol (left untouched) completes
on every invocation.
Other parallel sections in Iterate() (ThreadedCalculateChange,
ThreadedLoadBalance1/2, post-process) do not call
SignalNeighborsAndWait and stay on ParallelizeArray.
The robustness test added in the previous PR drops from a 100%
per-invocation deadlock rate to 0/20 with this change. The baseline
itkParallelSparseFieldLevelSetImageFilterTest image-compare against
the on-disk MD5-pinned baseline passes bit-identically.
Tracks the test PR ("ENH: Add robustness test exposing
ParallelSparseFieldLevelSet deadlock") and depends on it being merged
first; the test currently fails on master and turns green with this
commit applied.
d395122 to
7e2c489
Compare
|
The deadlock this PR is fixing traces directly to Dženan's 2018–2019 refactor: replacing the global memory barrier with neighbor-only SignalNeighborsAndWait condition variables dispatched via ParallelizeArray is exactly the design that can't be satisfied under TBB worker-shortage (N CV-blocked tasks pin M<N workers). The original Awate algorithm and the protocol itself are sound; the dispatcher assumption introduced during the barrier-removal modernization is what breaks. |
dzenanz
left a comment
There was a problem hiding this comment.
The fix looks good. I don't have enough time to review the test.
|
My initial instinct that is a "WaitForThreads" blocker in a ParallelizeArray call is a violation of independence of the element computation. I presume that such an algorithm could easily be converted into two passes over the array. @dzenanz I think the documentation for this method in MultiThreaderBase should be updated to clarify the requirements of the calling function. |
You could propose a PR with the clarifications you would like to be there 😄 |
|
I spent some time further looking at the algorithm. I am not convinced that other deadlocks do not exist with other multi-threaders with different number of threads and work units due to the blocking for neighbors. I believe the easiest change to make to preserve this original algorithm which assumes a fixed number of threads and work units is to hard code usage of the PlatforMultiThreader. Similar to what is done here: Ideally another algorithm would be implemented which does not use blocking and waiting. |
Iterate()'s ThreadedApplyUpdate dispatch blocks in SignalNeighborsAndWait until neighbor work units arrive, so every work unit needs its own concurrent OS thread. Pool/TBB ParallelizeArray queues N tasks onto a bounded worker set; CV-blocked tasks pin the workers so queued units never start, deadlocking. Force PlatformMultiThreader with DynamicMultiThreadingOff in the constructor (as SLICImageFilter does) so every work unit gets a dedicated OS thread. The neighbor-only sync protocol is unchanged. Supersedes #6321, which spawned explicit std::threads for only one ParallelizeArray call; forcing the threader preserves the filter's fixed thread==work-unit assumption across all parallel sections. Co-Authored-By: Bradley Lowekamp <blowekamp@mail.nih.gov>
…rm-multithreader BUG: Fix ParallelSparseFieldLevelSet deadlock with PlatformMultiThreader (supersedes #6321)
|
Superceeded by: #6337 |
Fix for the
ParallelSparseFieldLevelSetImageFilterdispatch-starvation deadlock demonstrated by #6320. Tracks #6320; depends on it merging first (or rebase againstmainafter #6320 lands so this PR's diff becomes just the.hxxchange).Iterate()dispatches itsThreadedApplyUpdatebody viamt->ParallelizeArray(). That body invokesSignalNeighborsAndWait()repeatedly, blocking on per-threadstd::condition_variables — the neighbor-only sync protocol therefore requires every work unit to hold its own concurrent OS thread for the full duration of the dispatch. Pool/TBB-backedParallelizeArraycannot guarantee this: N tasks are queued onto an M-worker pool where M may be < N, and CV-blocked tasks pin all workers so queued tasks never start. Deadlock.This PR replaces only the one
ParallelizeArraycall that contains innerSignalNeighborsAndWaitwith explicitstd::threadspawning: work unit 0 runs on the calling thread,N − 1std::threads are spawned for the remainder and joined at the end. The existing neighbor-only signal/wait protocol (SignalNeighborsAndWait/SignalNeighbor/WaitForNeighborand them_Semaphore/m_Lock/m_Conditiondata members) is left untouched — the bug was that the dispatcher couldn't satisfy the protocol's requirement, not the protocol itself.Scope
The other
ParallelizeArrayinvocations inIterate()(ThreadedCalculateChange,ThreadedLoadBalance1,ThreadedLoadBalance2, the post-process pass) stay onParallelizeArray. They do not callSignalNeighborsAndWaitinside the body, so no CV-blocking happens inside the dispatch and pool/TBB scheduling works for them.SignalNeighborsAndWaitand all the supporting machinery are unchanged.Verification
Local soak on a 16-core x86 box with TBB enabled:
itkParallelSparseFieldLevelSetImageFilterTest(image-compare), TBB ×100The baseline image-compare against
Baseline/BasicFilters/ParallelSparseFieldLevelSetImageFilterTest.mhapasses bit-identically.Why explicit std::thread rather than a global barrier
A first attempt replaced the neighbor-only signal/wait with a generation-counted global barrier (
std::mutex+std::condition_variable+ arrival counter). The global barrier itself was correct, but it required all N waiters to be simultaneously live — whichParallelizeArraystill cannot guarantee under TBB worker shortage. The global-barrier rewrite had a worse deadlock rate (27% per run) than the original.Explicit
std::threadspawning sidesteps the dispatcher's worker-count limit entirely: each work unit gets its own OS thread, lives for the full dispatch, and the unchanged neighbor sync protocol completes naturally.