Skip to content

BUG: Fix ParallelSparseFieldLevelSet deadlock with explicit std::threads#6321

Closed
hjmjohnson wants to merge 2 commits into
InsightSoftwareConsortium:mainfrom
hjmjohnson:fix/pslsif-deadlock-explicit-threads
Closed

BUG: Fix ParallelSparseFieldLevelSet deadlock with explicit std::threads#6321
hjmjohnson wants to merge 2 commits into
InsightSoftwareConsortium:mainfrom
hjmjohnson:fix/pslsif-deadlock-explicit-threads

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

Fix for the ParallelSparseFieldLevelSetImageFilter dispatch-starvation deadlock demonstrated by #6320. Tracks #6320; depends on it merging first (or rebase against main after #6320 lands so this PR's diff becomes just the .hxx change).

Iterate() dispatches its ThreadedApplyUpdate body via mt->ParallelizeArray(). That body invokes SignalNeighborsAndWait() repeatedly, blocking on per-thread std::condition_variables — the neighbor-only sync protocol therefore requires every work unit to hold its own concurrent OS thread for the full duration of the dispatch. Pool/TBB-backed ParallelizeArray cannot guarantee this: N tasks are queued onto an M-worker pool where M may be < N, and CV-blocked tasks pin all workers so queued tasks never start. Deadlock.

This PR replaces only the one ParallelizeArray call that contains inner SignalNeighborsAndWait with explicit std::thread spawning: work unit 0 runs on the calling thread, N − 1 std::threads are spawned for the remainder and joined at the end. The existing neighbor-only signal/wait protocol (SignalNeighborsAndWait/SignalNeighbor/WaitForNeighbor and the m_Semaphore/m_Lock/m_Condition data members) is left untouched — the bug was that the dispatcher couldn't satisfy the protocol's requirement, not the protocol itself.

Scope

The other ParallelizeArray invocations in Iterate() (ThreadedCalculateChange, ThreadedLoadBalance1, ThreadedLoadBalance2, the post-process pass) stay on ParallelizeArray. They do not call SignalNeighborsAndWait inside the body, so no CV-blocking happens inside the dispatch and pool/TBB scheduling works for them.

SignalNeighborsAndWait and all the supporting machinery are unchanged.

Verification

Local soak on a 16-core x86 box with TBB enabled:

Test master (no fix) with this PR
Baseline itkParallelSparseFieldLevelSetImageFilterTest (image-compare), TBB ×100 94/100 100/100
Robustness test from #6320, TBB ×30 0/30 (100% deadlock) 20/20
Robustness test, POOL ×20 20/20 (no worker shortage path) 20/20
Robustness test, PLATFORM ×20 20/20 (no worker shortage path) 20/20

The baseline image-compare against Baseline/BasicFilters/ParallelSparseFieldLevelSetImageFilterTest.mha passes bit-identically.

Why explicit std::thread rather than a global barrier

A first attempt replaced the neighbor-only signal/wait with a generation-counted global barrier (std::mutex + std::condition_variable + arrival counter). The global barrier itself was correct, but it required all N waiters to be simultaneously live — which ParallelizeArray still cannot guarantee under TBB worker shortage. The global-barrier rewrite had a worse deadlock rate (27% per run) than the original.

Explicit std::thread spawning sidesteps the dispatcher's worker-count limit entirely: each work unit gets its own OS thread, lives for the full dispatch, and the unchanged neighbor sync protocol completes naturally.

@github-actions github-actions Bot added type:Bug Inconsistencies or issues which will cause an incorrect result under some or all circumstances type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct area:Segmentation Issues affecting the Segmentation module labels May 21, 2026
@hjmjohnson hjmjohnson marked this pull request as ready for review May 21, 2026 22:35
@greptile-apps

This comment was marked as resolved.

@hjmjohnson hjmjohnson force-pushed the fix/pslsif-deadlock-explicit-threads branch 3 times, most recently from ee2083a to d395122 Compare May 23, 2026 18:57
ParallelSparseFieldLevelSetImageFilter::Iterate() dispatches its
per-iteration ThreadedApplyUpdate body via mt->ParallelizeArray().
The body invokes SignalNeighborsAndWait() several times, blocking on
per-thread std::condition_variables until both neighbor work units
have arrived at the same phase.  This neighbor-only protocol assumes
every work unit holds its own concurrent OS thread for the full
duration of the dispatch.  Pool/TBB-backed ParallelizeArray does not
honor this: N work units are submitted to a shared worker pool whose
size is bounded by core count, and CV-blocked tasks pin the pool so
queued work units never start and the dispatch deadlocks.

Reported as the intermittent timeout of
itkParallelSparseFieldLevelSetImageFilterTest in CDash test 2570265561
("Trying mf->Update()" followed by a 1000s timeout) on Windows
Release with TBB.

The existing test fails at ~3% per run under TBB, which is too rare
to catch the bug reliably in CI.  This commit adds a new test
itkParallelSparseFieldLevelSetImageFilterRobustnessTest that drives
the deadlock to a high per-test-invocation probability:

  Scenario 1 (sweep x repeat): cycle 1/2/4/8/11/16/32 work-unit
    counts for 20 outer iterations, totalling 140 short filter runs.
    Per-run pool-starvation deadlock probability is low but
    amplification gives a ~95% per-test-invocation hit rate.

  Scenario 2 (determinism): three repeated runs at workUnits=11 must
    produce bit-identical output.  Guards against future barrier
    rewrites that introduce numerical non-determinism.

  Scenario 3 (concurrent multi-pipeline): eight pipelines run
    simultaneously via std::async (88 work units competing for a
    core-count-bounded worker pool).  Cross-pipeline contention
    directly probes the dispatch-starvation deadlock path.

Local measurements on a 16-core x86 box with TBB:
  - Upstream code, 30 invocations: 30 timeouts (100% deadlock).
  - Upstream code, POOL/PLATFORM: pass (no worker shortage possible).
  - Test runtime with a working synchronization: ~5s wallclock.
  - CTest TIMEOUT set to 30 (6x leeway over expected runtime).

This commit intentionally introduces a failing test on master.  The
proposed fix lands in a follow-up PR.
Replace the one ParallelizeArray call in
ParallelSparseFieldLevelSetImageFilter::Iterate() that contains inner
SignalNeighborsAndWait calls with explicit std::thread spawning.  The
inner sync requires every work unit to hold its own concurrent OS
thread for the full duration of the dispatch; ParallelizeArray's
pool/TBB backends cannot guarantee this when work-unit count exceeds
free worker count.  std::thread guarantees one OS thread per work
unit so the neighbor-only sync protocol (left untouched) completes
on every invocation.

Other parallel sections in Iterate() (ThreadedCalculateChange,
ThreadedLoadBalance1/2, post-process) do not call
SignalNeighborsAndWait and stay on ParallelizeArray.

The robustness test added in the previous PR drops from a 100%
per-invocation deadlock rate to 0/20 with this change.  The baseline
itkParallelSparseFieldLevelSetImageFilterTest image-compare against
the on-disk MD5-pinned baseline passes bit-identically.

Tracks the test PR ("ENH: Add robustness test exposing
ParallelSparseFieldLevelSet deadlock") and depends on it being merged
first; the test currently fails on master and turns green with this
commit applied.
@hjmjohnson hjmjohnson force-pushed the fix/pslsif-deadlock-explicit-threads branch from d395122 to 7e2c489 Compare May 23, 2026 23:08
@hjmjohnson hjmjohnson requested review from dzenanz and pyushkevich May 24, 2026 15:01
@hjmjohnson
Copy link
Copy Markdown
Member Author

The deadlock this PR is fixing traces directly to Dženan's 2018–2019 refactor: replacing the global memory barrier with neighbor-only SignalNeighborsAndWait condition variables dispatched via ParallelizeArray is exactly the design that can't be satisfied under TBB worker-shortage (N CV-blocked tasks pin M<N workers). The original Awate algorithm and the protocol itself are sound; the dispatcher assumption introduced during the barrier-removal modernization is what breaks.

Copy link
Copy Markdown
Member

@dzenanz dzenanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good. I don't have enough time to review the test.

@blowekamp
Copy link
Copy Markdown
Member

My initial instinct that is a "WaitForThreads" blocker in a ParallelizeArray call is a violation of independence of the element computation. I presume that such an algorithm could easily be converted into two passes over the array.

@dzenanz I think the documentation for this method in MultiThreaderBase should be updated to clarify the requirements of the calling function.

@dzenanz
Copy link
Copy Markdown
Member

dzenanz commented May 26, 2026

documentation for this method in MultiThreaderBase should be updated

You could propose a PR with the clarifications you would like to be there 😄

@blowekamp
Copy link
Copy Markdown
Member

I spent some time further looking at the algorithm. I am not convinced that other deadlocks do not exist with other multi-threaders with different number of threads and work units due to the blocking for neighbors.

I believe the easiest change to make to preserve this original algorithm which assumes a fixed number of threads and work units is to hard code usage of the PlatforMultiThreader. Similar to what is done here:

this->SetMultiThreader(PlatformMultiThreader::New());

Ideally another algorithm would be implemented which does not use blocking and waiting.

hjmjohnson added a commit that referenced this pull request May 27, 2026
Iterate()'s ThreadedApplyUpdate dispatch blocks in SignalNeighborsAndWait
until neighbor work units arrive, so every work unit needs its own
concurrent OS thread. Pool/TBB ParallelizeArray queues N tasks onto a
bounded worker set; CV-blocked tasks pin the workers so queued units
never start, deadlocking.

Force PlatformMultiThreader with DynamicMultiThreadingOff in the
constructor (as SLICImageFilter does) so every work unit gets a dedicated
OS thread. The neighbor-only sync protocol is unchanged.

Supersedes #6321, which spawned explicit std::threads for only one
ParallelizeArray call; forcing the threader preserves the filter's
fixed thread==work-unit assumption across all parallel sections.

Co-Authored-By: Bradley Lowekamp <blowekamp@mail.nih.gov>
hjmjohnson added a commit that referenced this pull request May 27, 2026
…rm-multithreader

BUG: Fix ParallelSparseFieldLevelSet deadlock with PlatformMultiThreader (supersedes #6321)
@hjmjohnson
Copy link
Copy Markdown
Member Author

Superceeded by: #6337

@hjmjohnson hjmjohnson closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Segmentation Issues affecting the Segmentation module type:Bug Inconsistencies or issues which will cause an incorrect result under some or all circumstances type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants