fix: eliminate data races in pwrite/SPSC path (TSan verified)#683
Closed
KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
Closed
fix: eliminate data races in pwrite/SPSC path (TSan verified)#683KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
Conversation
added 3 commits
April 13, 2026 10:29
mNextSeq was a plain size_t array written by worker threads in inputPwrite() and read by setInputCompletedPwrite() with no synchronization -- a C++ data race (undefined behaviour). A stale read could produce a wrong lastSeq value, causing ftruncate() to silently truncate the output file at the wrong offset and drop the final gz member(s). Fix: change mNextSeq to std::atomic<size_t>[]. - Worker threads write with memory_order_release after each pack, establishing a happens-before edge for the completion reader. - setInputCompletedPwrite() opens with an acquire fence before reading with memory_order_relaxed, ensuring all prior worker writes are visible before the ftruncate() call.
The `head` pointer in SingleProducerSingleConsumerList was a plain
(non-atomic) pointer despite being written by the producer thread
(produce(), first-item branch) and read concurrently by the consumer
thread (canBeConsumed(), consume()). ThreadSanitizer reported 15 data
races at singleproducersingleconsumerlist.h:100.
Fixes applied:
- `head` declared as `std::atomic<LockFreeListItem<T>*>` (tail stays
non-atomic — producer-private after first item is published)
- Constructor: `head.store(NULL, relaxed)`
- produce() first-item branch:
set tail = item first (producer-private write),
then `head.store(item, release)` to publish atomically to consumer
then `item->nextItemReady.store(true, release)` to signal readiness
- canBeConsumed():
`head.load(acquire)` for NULL check (syncs with produce release),
`head.load(relaxed)` for nextItemReady dereference (covered by
the preceding acquire)
- consume():
`head.load(acquire)` to read current head,
`head.store(h->nextItem, release)` to advance — establishes
happens-before with next canBeConsumed() acquire on head
Also fixes the else-branch nextItemReady assignment to use
`memory_order_release` (was implicit seq_cst, which does NOT prevent
compiler reordering of the preceding `tail->nextItem = item` write).
ThreadSanitizer reported data races in ReadPool and SPSC when multiple
worker threads called ReadPool::input() concurrently:
readpool.cpp:23 — mIsFull read vs. updateFullStatus() write
readpool.cpp:27 — mProduced++ (non-atomic RMW) by multiple threads
readpool.cpp:53 — mIsFull write vs. concurrent reads
spsc.h:90 — size(): produced (producer-written) vs. consumed
(consumer-written) read without synchronization
Fixes in readpool.h:
- mIsFull : bool → std::atomic<bool>
- mProduced : size_t → std::atomic<size_t>
(atomic::operator++ and atomic::operator= are sufficient;
no changes to readpool.cpp required)
Fixes in singleproducersingleconsumerlist.h:
- produced, consumed : unsigned long → std::atomic<unsigned long>
- size(): load both with memory_order_relaxed (approximate count used
only as a soft back-pressure threshold)
- produce(): produced.fetch_add(1, relaxed)
- consume(): consumed.fetch_add(1, relaxed) with local snapshot for
the (consumed & 0xFFF) recycle check
- makeItem(): produced.load(relaxed) snapshot before >> and & ops
- recycle(): consumed.load(relaxed) before >> op
After all four commits (mNextSeq, SPSC head, ReadPool/SPSC atomics),
ThreadSanitizer reports zero data races on 5k-read PE mode 8-thread
workload.
Member
Author
|
Closing in favor of #684 which includes all 3 fixes from this PR plus the C++23 threading refactor with Highway futex optimization (wall -15%, sys -82% vs master). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix three data races detected by ThreadSanitizer (TSan) in the multi-threaded write and read-pool path.
Changes
cf64a66fix: eliminate data race onmNextSeqin pwrite path — was read/written across threads without synchronizationb32ea3afix: make SPSC queuemHeadpointer atomic — producer/consumer race on head index6316345fix: makeReadPoolcounters (mProduced/mConsumed) atomicTSan Methodology
Build environment
No system-level changes needed. A self-contained GCC toolchain was installed via
micromambainto a user-writable prefix, with all fastp dependencies (libdeflate,isa-l,libhwy):TSan build
Because fastp's
MakefileexpandsCXXFLAGSfrom the environment, the instrumented binary is produced by:TSan linkage verified with:
TSan run
A small FASTQ dataset (~10k reads) was used to keep runtime feasible (TSan adds 5–10× overhead). All races were collected in a single run:
TSAN_OPTIONS="halt_on_error=0:log_path=/tmp/tsan_out:history_size=5" ./fastp -i test.fq -o out.fq.gz --thread 4halt_on_error=0ensures all races are collected rather than stopping at the first.history_size=5produces more accurate stack traces.Iterative fix cycle
TSan races cascade — fixing one race exposes deeper ones hidden beneath it:
Result
Zero races across all three races after the fix set.
Root causes
mNextSeqWriterThread::outputTaskstd::atomic<int>mHeadSPSCQueuestd::atomic<size_t>mProduced/mConsumedReadPoolsize()during concurrent produce/consumestd::atomic<size_t>Performance Benchmark
Tested on 1M paired-end reads (150bp, simulated), 4 threads, 3 runs each (median reported). Both binaries built from the same GCC 15 toolchain with identical flags.
No performance regression. The atomic operations on
mNextSeq,mHead, andmProduced/mConsumedare on cold paths (once per read-chunk, not per read) — their overhead is negligible compared to compression and I/O.Output correctness
MD5 of decompressed output verified byte-for-byte identical between master and this PR: