BUG: Fix ParallelSparseFieldLevelSet deadlock with explicit std::threads by hjmjohnson · Pull Request #6321 · InsightSoftwareConsortium/ITK

hjmjohnson · 2026-05-21T16:56:00Z

Fix for the ParallelSparseFieldLevelSetImageFilter dispatch-starvation deadlock demonstrated by #6320. Tracks #6320; depends on it merging first (or rebase against main after #6320 lands so this PR's diff becomes just the .hxx change).

Iterate() dispatches its ThreadedApplyUpdate body via mt->ParallelizeArray(). That body invokes SignalNeighborsAndWait() repeatedly, blocking on per-thread std::condition_variables — the neighbor-only sync protocol therefore requires every work unit to hold its own concurrent OS thread for the full duration of the dispatch. Pool/TBB-backed ParallelizeArray cannot guarantee this: N tasks are queued onto an M-worker pool where M may be < N, and CV-blocked tasks pin all workers so queued tasks never start. Deadlock.

This PR replaces only the one ParallelizeArray call that contains inner SignalNeighborsAndWait with explicit std::thread spawning: work unit 0 runs on the calling thread, N − 1 std::threads are spawned for the remainder and joined at the end. The existing neighbor-only signal/wait protocol (SignalNeighborsAndWait/SignalNeighbor/WaitForNeighbor and the m_Semaphore/m_Lock/m_Condition data members) is left untouched — the bug was that the dispatcher couldn't satisfy the protocol's requirement, not the protocol itself.

Scope

The other ParallelizeArray invocations in Iterate() (ThreadedCalculateChange, ThreadedLoadBalance1, ThreadedLoadBalance2, the post-process pass) stay on ParallelizeArray. They do not call SignalNeighborsAndWait inside the body, so no CV-blocking happens inside the dispatch and pool/TBB scheduling works for them.

SignalNeighborsAndWait and all the supporting machinery are unchanged.

Verification

Local soak on a 16-core x86 box with TBB enabled:

Test	master (no fix)	with this PR
Baseline `itkParallelSparseFieldLevelSetImageFilterTest` (image-compare), TBB ×100	94/100	100/100
Robustness test from #6320, TBB ×30	0/30 (100% deadlock)	20/20
Robustness test, POOL ×20	20/20 (no worker shortage path)	20/20
Robustness test, PLATFORM ×20	20/20 (no worker shortage path)	20/20

The baseline image-compare against Baseline/BasicFilters/ParallelSparseFieldLevelSetImageFilterTest.mha passes bit-identically.

Why explicit std::thread rather than a global barrier

A first attempt replaced the neighbor-only signal/wait with a generation-counted global barrier (std::mutex + std::condition_variable + arrival counter). The global barrier itself was correct, but it required all N waiters to be simultaneously live — which ParallelizeArray still cannot guarantee under TBB worker shortage. The global-barrier rewrite had a worse deadlock rate (27% per run) than the original.

Explicit std::thread spawning sidesteps the dispatcher's worker-count limit entirely: each work unit gets its own OS thread, lives for the full dispatch, and the unchanged neighbor sync protocol completes naturally.

ParallelSparseFieldLevelSetImageFilter::Iterate() dispatches its per-iteration ThreadedApplyUpdate body via mt->ParallelizeArray(). The body invokes SignalNeighborsAndWait() several times, blocking on per-thread std::condition_variables until both neighbor work units have arrived at the same phase. This neighbor-only protocol assumes every work unit holds its own concurrent OS thread for the full duration of the dispatch. Pool/TBB-backed ParallelizeArray does not honor this: N work units are submitted to a shared worker pool whose size is bounded by core count, and CV-blocked tasks pin the pool so queued work units never start and the dispatch deadlocks. Reported as the intermittent timeout of itkParallelSparseFieldLevelSetImageFilterTest in CDash test 2570265561 ("Trying mf->Update()" followed by a 1000s timeout) on Windows Release with TBB. The existing test fails at ~3% per run under TBB, which is too rare to catch the bug reliably in CI. This commit adds a new test itkParallelSparseFieldLevelSetImageFilterRobustnessTest that drives the deadlock to a high per-test-invocation probability: Scenario 1 (sweep x repeat): cycle 1/2/4/8/11/16/32 work-unit counts for 20 outer iterations, totalling 140 short filter runs. Per-run pool-starvation deadlock probability is low but amplification gives a ~95% per-test-invocation hit rate. Scenario 2 (determinism): three repeated runs at workUnits=11 must produce bit-identical output. Guards against future barrier rewrites that introduce numerical non-determinism. Scenario 3 (concurrent multi-pipeline): eight pipelines run simultaneously via std::async (88 work units competing for a core-count-bounded worker pool). Cross-pipeline contention directly probes the dispatch-starvation deadlock path. Local measurements on a 16-core x86 box with TBB: - Upstream code, 30 invocations: 30 timeouts (100% deadlock). - Upstream code, POOL/PLATFORM: pass (no worker shortage possible). - Test runtime with a working synchronization: ~5s wallclock. - CTest TIMEOUT set to 30 (6x leeway over expected runtime). This commit intentionally introduces a failing test on master. The proposed fix lands in a follow-up PR.

Replace the one ParallelizeArray call in ParallelSparseFieldLevelSetImageFilter::Iterate() that contains inner SignalNeighborsAndWait calls with explicit std::thread spawning. The inner sync requires every work unit to hold its own concurrent OS thread for the full duration of the dispatch; ParallelizeArray's pool/TBB backends cannot guarantee this when work-unit count exceeds free worker count. std::thread guarantees one OS thread per work unit so the neighbor-only sync protocol (left untouched) completes on every invocation. Other parallel sections in Iterate() (ThreadedCalculateChange, ThreadedLoadBalance1/2, post-process) do not call SignalNeighborsAndWait and stay on ParallelizeArray. The robustness test added in the previous PR drops from a 100% per-invocation deadlock rate to 0/20 with this change. The baseline itkParallelSparseFieldLevelSetImageFilterTest image-compare against the on-disk MD5-pinned baseline passes bit-identically. Tracks the test PR ("ENH: Add robustness test exposing ParallelSparseFieldLevelSet deadlock") and depends on it being merged first; the test currently fails on master and turns green with this commit applied.

hjmjohnson · 2026-05-24T15:02:51Z

The deadlock this PR is fixing traces directly to Dženan's 2018–2019 refactor: replacing the global memory barrier with neighbor-only SignalNeighborsAndWait condition variables dispatched via ParallelizeArray is exactly the design that can't be satisfied under TBB worker-shortage (N CV-blocked tasks pin M<N workers). The original Awate algorithm and the protocol itself are sound; the dispatcher assumption introduced during the barrier-removal modernization is what breaks.

dzenanz

The fix looks good. I don't have enough time to review the test.

blowekamp · 2026-05-26T12:55:49Z

My initial instinct that is a "WaitForThreads" blocker in a ParallelizeArray call is a violation of independence of the element computation. I presume that such an algorithm could easily be converted into two passes over the array.

@dzenanz I think the documentation for this method in MultiThreaderBase should be updated to clarify the requirements of the calling function.

dzenanz · 2026-05-26T13:14:43Z

documentation for this method in MultiThreaderBase should be updated

You could propose a PR with the clarifications you would like to be there 😄

blowekamp · 2026-05-26T14:42:31Z

I spent some time further looking at the algorithm. I am not convinced that other deadlocks do not exist with other multi-threaders with different number of threads and work units due to the blocking for neighbors.

I believe the easiest change to make to preserve this original algorithm which assumes a fixed number of threads and work units is to hard code usage of the PlatforMultiThreader. Similar to what is done here:

ITK/Modules/Segmentation/SuperPixel/include/itkSLICImageFilter.hxx

Line 49 in 7d01b0e

this->SetMultiThreader(PlatformMultiThreader::New());

Ideally another algorithm would be implemented which does not use blocking and waiting.

Iterate()'s ThreadedApplyUpdate dispatch blocks in SignalNeighborsAndWait until neighbor work units arrive, so every work unit needs its own concurrent OS thread. Pool/TBB ParallelizeArray queues N tasks onto a bounded worker set; CV-blocked tasks pin the workers so queued units never start, deadlocking. Force PlatformMultiThreader with DynamicMultiThreadingOff in the constructor (as SLICImageFilter does) so every work unit gets a dedicated OS thread. The neighbor-only sync protocol is unchanged. Supersedes #6321, which spawned explicit std::threads for only one ParallelizeArray call; forcing the threader preserves the filter's fixed thread==work-unit assumption across all parallel sections. Co-Authored-By: Bradley Lowekamp <blowekamp@mail.nih.gov>

…rm-multithreader BUG: Fix ParallelSparseFieldLevelSet deadlock with PlatformMultiThreader (supersedes #6321)

hjmjohnson · 2026-05-27T01:35:16Z

Superceeded by: #6337

hjmjohnson marked this pull request as ready for review May 21, 2026 22:35

This comment was marked as resolved.

Sign in to view

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

hjmjohnson force-pushed the fix/pslsif-deadlock-explicit-threads branch 3 times, most recently from ee2083a to d395122 Compare May 23, 2026 18:57

hjmjohnson added 2 commits May 23, 2026 18:05

hjmjohnson force-pushed the fix/pslsif-deadlock-explicit-threads branch from d395122 to 7e2c489 Compare May 23, 2026 23:08

hjmjohnson requested review from dzenanz and pyushkevich May 24, 2026 15:01

hjmjohnson mentioned this pull request May 24, 2026

ENH: Add consistency (should fail) to force test failure in ParallelSparseFieldLevelSet deadlock #6320

Merged

dzenanz reviewed May 25, 2026

View reviewed changes

hjmjohnson mentioned this pull request May 26, 2026

BUG: Fix ParallelSparseFieldLevelSet deadlock with PlatformMultiThreader (supersedes #6321) #6337

Merged

hjmjohnson added a commit that referenced this pull request May 27, 2026

Merge pull request #6337 from hjmjohnson/supersede-6321-pslsif-platfo…

fc15a07

…rm-multithreader BUG: Fix ParallelSparseFieldLevelSet deadlock with PlatformMultiThreader (supersedes #6321)

hjmjohnson closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: Fix ParallelSparseFieldLevelSet deadlock with explicit std::threads#6321

BUG: Fix ParallelSparseFieldLevelSet deadlock with explicit std::threads#6321
hjmjohnson wants to merge 2 commits into
InsightSoftwareConsortium:mainfrom
hjmjohnson:fix/pslsif-deadlock-explicit-threads

hjmjohnson commented May 21, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

hjmjohnson commented May 24, 2026

Uh oh!

dzenanz left a comment

Uh oh!

blowekamp commented May 26, 2026

Uh oh!

dzenanz commented May 26, 2026

Uh oh!

blowekamp commented May 26, 2026

Uh oh!

hjmjohnson commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

hjmjohnson commented May 21, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

hjmjohnson commented May 24, 2026

Uh oh!

dzenanz left a comment

Choose a reason for hiding this comment

Uh oh!

blowekamp commented May 26, 2026

Uh oh!

dzenanz commented May 26, 2026

Uh oh!

blowekamp commented May 26, 2026

Uh oh!

hjmjohnson commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants