Skip to content

fix(concurrency): fix ThreadPool shutdown race causing TSan hang#316

Merged
mvillmow merged 1 commit into
mainfrom
145-auto-impl
Apr 23, 2026
Merged

fix(concurrency): fix ThreadPool shutdown race causing TSan hang#316
mvillmow merged 1 commit into
mainfrom
145-auto-impl

Conversation

@mvillmow
Copy link
Copy Markdown
Collaborator

Summary

  • Fix race condition in ThreadPool::shutdown() where shutdown_requested_ was stored outside queue_mutex_, allowing workers to miss the notify_all() and hang indefinitely under TSan
  • Re-enable ThreadPoolTest.NoWorkAfterShutdown (was DISABLED_NoWorkAfterShutdown)
  • Fix spdlog linkage visibility (PRIVATEPUBLIC) so public headers propagate include paths to consumers
  • Add tsan Conan profile and CMake preset (cmake --preset tsan, just test-tsan)

Root Cause

shutdown() called shutdown_requested_.store(true) outside queue_mutex_, then condition_.notify_all(). Under TSan's timing overhead a worker could:

  1. Finish its current work item and release the lock
  2. Re-acquire the lock and evaluate the condition predicate — seeing shutdown_requested_ == false (store not yet observed)
  3. Go back to sleep
  4. Miss the notify_all() that already fired

The fix acquires queue_mutex_ before setting the flag. The C++ condition variable contract requires the predicate-modifying store to be serialised under the same mutex that the waiting thread holds during predicate evaluation — this is the standard missed-wakeup prevention pattern.

Test plan

  • ThreadPoolTest.NoWorkAfterShutdown re-enabled and passes in debug build
  • All 11 ThreadPoolTest.* tests pass
  • concurrency_unit_tests builds without errors (spdlog linkage fix)
  • Run under TSan: just test-tsan (blocked by upstream concurrentqueue incompatibility with -fsanitize=thread — pre-existing issue, not introduced here; ThreadPool itself has no concurrentqueue dependency)

Closes #145

🤖 Generated with Claude Code

The root cause: shutdown() stored to shutdown_requested_ outside the
queue_mutex_ lock, then called notify_all(). Under TSan's timing overhead
a worker thread could finish work, re-acquire the lock, evaluate the
condition predicate while shutdown_requested_ was still false (store not
yet observed), go back to sleep, then miss the notify_all() — causing an
indefinite hang.

Fix: acquire queue_mutex_ before storing shutdown_requested_ = true so
that the store and the condition variable notification are serialised
under the same lock. Workers evaluating the predicate must hold the same
lock, guaranteeing they either see the flag before sleeping or are woken
by notify_all() — the classic condition-variable happens-before
requirement.

Additional changes:
- Re-enable ThreadPoolTest.NoWorkAfterShutdown (was DISABLED_)
- Fix spdlog linkage: change PRIVATE -> PUBLIC on keystone_concurrency so
  that public headers including spdlog/* propagate include paths to
  consumers (fixes concurrency_unit_tests compilation)
- Add tsan Conan profile and CMake tsan preset (cmake --preset tsan)
- Add deps-tsan, build-tsan, test-tsan justfile recipes; deps-tsan strips
  the Conan-generated CMakePresets.json from the tsan output folder to
  avoid the duplicate conan-debug preset collision

Closes #145

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mvillmow mvillmow enabled auto-merge (rebase) April 23, 2026 14:34
@mvillmow mvillmow merged commit 906aa7c into main Apr 23, 2026
15 of 17 checks passed
@mvillmow mvillmow deleted the 145-auto-impl branch April 23, 2026 14:53
@github-actions
Copy link
Copy Markdown

CI Summary

Check Status
Code Quality ✅ success
Sanitizers (ASan, UBSan, TSan, LSan, MSan) ✅ success
Benchmarks ✅ success
Coverage ✅ success

View full run

@github-actions
Copy link
Copy Markdown

Security Scan Results

  • ⚠️ Secret Scanning: No results available
  • ✅ SAST: Completed (check Security tab for details)
  • ✅ Dependency Scanning: Completed
  • ✅ C++ Static Analysis: Completed
  • ✅ Docker Image Scanning: 0 high, 21 medium vulnerabilities (acceptable)

Recommendations

  • Review findings in the GitHub Security tab
  • Check artifact uploads for detailed reports
  • Address critical Docker vulnerabilities immediately

Workflow: Security Scanning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ThreadPoolTest.NoWorkAfterShutdown hangs indefinitely under TSan

1 participant