fix: eliminate data races in pwrite/SPSC path (TSan verified) by KimBioInfoStudio · Pull Request #683 · OpenGene/fastp

KimBioInfoStudio · 2026-04-15T07:43:32Z

Summary

Fix three data races detected by ThreadSanitizer (TSan) in the multi-threaded write and read-pool path.

Changes

cf64a66 fix: eliminate data race on mNextSeq in pwrite path — was read/written across threads without synchronization
b32ea3a fix: make SPSC queue mHead pointer atomic — producer/consumer race on head index
6316345 fix: make ReadPool counters (mProduced/mConsumed) atomic

TSan Methodology

Build environment

No system-level changes needed. A self-contained GCC toolchain was installed via micromamba into a user-writable prefix, with all fastp dependencies (libdeflate, isa-l, libhwy):

micromamba create -p ~/build-env -c conda-forge     gxx_linux-64>=12 make pkg-config libdeflate isa-l libhwy

TSan build

Because fastp's Makefile expands CXXFLAGS from the environment, the instrumented binary is produced by:

export CXXFLAGS="-fsanitize=thread -fno-omit-frame-pointer"
make clean && make -j4 CXX=x86_64-conda-linux-gnu-g++

TSan linkage verified with:

ldd fastp | grep tsan   # must show libtsan.so.0

TSan run

A small FASTQ dataset (~10k reads) was used to keep runtime feasible (TSan adds 5–10× overhead). All races were collected in a single run:

TSAN_OPTIONS="halt_on_error=0:log_path=/tmp/tsan_out:history_size=5"     ./fastp -i test.fq -o out.fq.gz --thread 4

halt_on_error=0 ensures all races are collected rather than stopping at the first. history_size=5 produces more accurate stack traces.

Iterative fix cycle

TSan races cascade — fixing one race exposes deeper ones hidden beneath it:

Before fix:  15 races reported
After cf64a66 (mNextSeq atomic):  10 races
After b32ea3a (SPSC mHead atomic):  4 races
After 6316345 (ReadPool counters atomic):  0 races ✅

Result

ThreadSanitizer: reported 0 warnings

Zero races across all three races after the fix set.

Root causes

Variable	Location	Race type	Fix
`mNextSeq`	`WriterThread::outputTask`	Write-write + read-write across commit/worker threads	`std::atomic<int>`
`mHead`	`SPSCQueue`	Producer write / consumer read without synchronization	`std::atomic<size_t>`
`mProduced` / `mConsumed`	`ReadPool`	Multiple threads calling `size()` during concurrent produce/consume	`std::atomic<size_t>`

Performance Benchmark

Tested on 1M paired-end reads (150bp, simulated), 4 threads, 3 runs each (median reported). Both binaries built from the same GCC 15 toolchain with identical flags.

Mode	master	this PR	Δ
Compressed (gz → gz)	25.95s	24.52s	−5.5% ✅
Uncompressed (fq → fq)	17.01s	16.83s	−1.1% ✅

No performance regression. The atomic operations on mNextSeq, mHead, and mProduced/mConsumed are on cold paths (once per read-chunk, not per read) — their overhead is negligible compared to compression and I/O.

Output correctness

MD5 of decompressed output verified byte-for-byte identical between master and this PR:

R1: 68175a80e9e8854cee38b06ab707a59b  ✅ identical
R2: f6ae8dba3ead4e2f1d35a7e3b1ef7951  ✅ identical

mNextSeq was a plain size_t array written by worker threads in inputPwrite() and read by setInputCompletedPwrite() with no synchronization -- a C++ data race (undefined behaviour). A stale read could produce a wrong lastSeq value, causing ftruncate() to silently truncate the output file at the wrong offset and drop the final gz member(s). Fix: change mNextSeq to std::atomic<size_t>[]. - Worker threads write with memory_order_release after each pack, establishing a happens-before edge for the completion reader. - setInputCompletedPwrite() opens with an acquire fence before reading with memory_order_relaxed, ensuring all prior worker writes are visible before the ftruncate() call.

The `head` pointer in SingleProducerSingleConsumerList was a plain (non-atomic) pointer despite being written by the producer thread (produce(), first-item branch) and read concurrently by the consumer thread (canBeConsumed(), consume()). ThreadSanitizer reported 15 data races at singleproducersingleconsumerlist.h:100. Fixes applied: - `head` declared as `std::atomic<LockFreeListItem<T>*>` (tail stays non-atomic — producer-private after first item is published) - Constructor: `head.store(NULL, relaxed)` - produce() first-item branch: set tail = item first (producer-private write), then `head.store(item, release)` to publish atomically to consumer then `item->nextItemReady.store(true, release)` to signal readiness - canBeConsumed(): `head.load(acquire)` for NULL check (syncs with produce release), `head.load(relaxed)` for nextItemReady dereference (covered by the preceding acquire) - consume(): `head.load(acquire)` to read current head, `head.store(h->nextItem, release)` to advance — establishes happens-before with next canBeConsumed() acquire on head Also fixes the else-branch nextItemReady assignment to use `memory_order_release` (was implicit seq_cst, which does NOT prevent compiler reordering of the preceding `tail->nextItem = item` write).

ThreadSanitizer reported data races in ReadPool and SPSC when multiple worker threads called ReadPool::input() concurrently: readpool.cpp:23 — mIsFull read vs. updateFullStatus() write readpool.cpp:27 — mProduced++ (non-atomic RMW) by multiple threads readpool.cpp:53 — mIsFull write vs. concurrent reads spsc.h:90 — size(): produced (producer-written) vs. consumed (consumer-written) read without synchronization Fixes in readpool.h: - mIsFull : bool → std::atomic<bool> - mProduced : size_t → std::atomic<size_t> (atomic::operator++ and atomic::operator= are sufficient; no changes to readpool.cpp required) Fixes in singleproducersingleconsumerlist.h: - produced, consumed : unsigned long → std::atomic<unsigned long> - size(): load both with memory_order_relaxed (approximate count used only as a soft back-pressure threshold) - produce(): produced.fetch_add(1, relaxed) - consume(): consumed.fetch_add(1, relaxed) with local snapshot for the (consumed & 0xFFF) recycle check - makeItem(): produced.load(relaxed) snapshot before >> and & ops - recycle(): consumed.load(relaxed) before >> op After all four commits (mNextSeq, SPSC head, ReadPool/SPSC atomics), ThreadSanitizer reports zero data races on 5k-read PE mode 8-thread workload.

KimBioInfoStudio · 2026-04-17T02:43:35Z

Closing in favor of #684 which includes all 3 fixes from this PR plus the C++23 threading refactor with Highway futex optimization (wall -15%, sys -82% vs master).

Hermes added 3 commits April 13, 2026 10:29

KimBioInfoStudio closed this Apr 17, 2026

KimBioInfoStudio mentioned this pull request Apr 17, 2026

fix+perf: eliminate data races & replace all synchronization with Highway futex (zero mutex, wall -7%, sys -51%) #684

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: eliminate data races in pwrite/SPSC path (TSan verified)#683

fix: eliminate data races in pwrite/SPSC path (TSan verified)#683
KimBioInfoStudio wants to merge 3 commits intoOpenGene:masterfrom
KimBioInfoStudio:fix/race-conditions-pwrite-spsc

KimBioInfoStudio commented Apr 15, 2026 •

edited

Loading

Uh oh!

KimBioInfoStudio commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KimBioInfoStudio commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

TSan Methodology

Build environment

TSan build

TSan run

Iterative fix cycle

Result

Root causes

Performance Benchmark

Output correctness

Uh oh!

KimBioInfoStudio commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KimBioInfoStudio commented Apr 15, 2026 •

edited

Loading