fix(tsan): clean TSan gtest suite; fix races and RefCountGuard activation window#554
Conversation
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…untGuard Races fixed by TSan: - callTraceHashTable.cpp: non-atomic read of _table in put() vs atomic CAS expansion; add __atomic_load_n ACQUIRE to pair with the ACQ_REL CAS. - stringDictionary.h: non-atomic read of row->next before overflow CAS; use __atomic_load_n RELAXED (the ACQUIRE on line 360 provides the happens-before for the newly-allocated SBTable). RefCountGuard activation-window fix: - waitForRefCountToClear / waitForAllRefCountsToClear skipped slots with count==0 without checking active_ptr. The pointer-first protocol means active_ptr is stored (RELEASE) before count++ (RELEASE), so count==0 with active_ptr set is a live activation window. Add the active_ptr ACQUIRE check to close it. - Add refCountGuard_ut.cpp to exercise the activation window scenario. - Update header docs to describe the corrected scanner protocol. TSan infrastructure (aarch64 / Docker Desktop): - run-docker-tests.sh: install clang-17 + libclang-rt-17-dev on aarch64 glibc images; add --privileged + sysctl vm.mmap_rnd_bits=28 for TSan; pass -Pnative.forceCompiler=clang++-17 (GCC 11's libtsan has 39-bit VMA only; Docker Desktop linuxkit uses 48-bit VMA). - PlatformUtils.kt: locateLibtsan() prefers clang's own libclang_rt.tsan-<arch>.so over GCC's libtsan. - ConfigurationPresets.kt: use dynamic lib name in TSan linker args (matches configureAsan pattern); add JVM-safe TSAN_OPTIONS (handle_segv/sigbus=0, use_sigaltstack=0, io_sync=0); set GTEST_DEATH_TEST_STYLE=threadsafe (fork unsupported under TSan). - GtestTaskBuilder.kt: strip -ltsan/-lclang_rt.tsan* from gtest link so clang statically embeds its own TSan runtime (prevents GCC/clang runtime collision). - ProfilerTestPlugin.kt: skip Java integration tests under TSan (SIGPROF at 1ms fires in TSan-instrumented code; re-entrance crashes the JVM). TSan coverage is provided by the C++ gtest suite. - tsan.supp: suppress JVM-internal libnio.so/libjava.so FD races. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
Pull request overview
This PR fixes TSan-detected races and updates the TSan test/build path, especially for aarch64 Docker Desktop environments with 48-bit VMA support needs.
Changes:
- Fixes atomic access races in
CallTraceHashTableandStringDictionary. - Closes the
RefCountGuardactivation-window gap and adds regression coverage. - Updates TSan Docker/build/test configuration to use compatible clang runtimes and skip incompatible JVM integration tests.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
utils/run-docker-tests.sh |
Adds aarch64 clang-17 setup and privileged/sysctl handling for TSan. |
gradle/sanitizers/tsan.supp |
Adds JVM library race suppressions. |
ddprof-lib/src/test/cpp/refCountGuard_ut.cpp |
Adds activation-window regression tests. |
ddprof-lib/src/main/cpp/stringDictionary.h |
Uses an atomic relaxed load for overflow-chain hinting. |
ddprof-lib/src/main/cpp/refCountGuard.h |
Updates protocol documentation for active pointer scanning. |
ddprof-lib/src/main/cpp/refCountGuard.cpp |
Checks active_ptr during zero-count wait scans. |
ddprof-lib/src/main/cpp/callTraceHashTable.cpp |
Uses acquire load for _table in put(). |
build-logic/conventions/src/main/kotlin/com/datadoghq/profiler/ProfilerTestPlugin.kt |
Skips JVM integration tests under TSan. |
build-logic/conventions/src/main/kotlin/com/datadoghq/native/util/PlatformUtils.kt |
Prefers clang TSan runtime when using clang. |
build-logic/conventions/src/main/kotlin/com/datadoghq/native/gtest/GtestTaskBuilder.kt |
Strips explicit TSan runtime link flags for gtest executables. |
build-logic/conventions/src/main/kotlin/com/datadoghq/native/config/ConfigurationPresets.kt |
Updates TSan linker args and test environment options. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2560589453
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…tion Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21e76a33c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
This comment has been minimized.
This comment has been minimized.
clearCurrentThreadTLS() now returns the detached pointer; test deletes it via deleteForTest() after the race-window assertion. Removes allow_failure from all four sanitizer jobs so pipeline fails on leaks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CI Test ResultsRun: #26598758953 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-05-28 20:13:03 UTC |
[ rc -ne 0 ] && {...} exits 1 when rc=0 (condition false), making the
pipeline exit 1 even on success. Use if/fi (exits 0 on false branch)
matching the ASan base template.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dispatch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
What does this PR do?:
Makes the TSan gtest suite pass cleanly on aarch64 (Docker Desktop / linuxkit kernel with 48-bit VMA), fixes two real data races plus a correctness bug in
RefCountGuardthat TSan exposed, and closes two ASan-reported issues (bounds-annotation gap and a test-fixture memory leak).Motivation:
TSan gtests were crashing before any test ran due to a VMA mismatch between GCC 11's
libtsan(39-bit VMA only) and Docker Desktop's linuxkit kernel (48-bit VMA). Once the environment was fixed, TSan found two real races and revealed theRefCountGuardactivation-window gap. ASan found a missingno_sanitize(bounds)on non-x86 architectures and aProfiledThreadleak in the T-02 test fixture. All four sanitizer jobs are now blocking — pipeline fails on any finding.Additional Notes:
Races fixed (found by TSan):
callTraceHashTable.cpp: plain read of_tableinput()raced with the atomic CAS in the table-expansion path. Fixed with__atomic_load_n(..., __ATOMIC_ACQUIRE)to pair with theACQ_RELCAS.stringDictionary.h: plain read ofrow->nextbefore the overflow-chain CAS raced with another thread's CAS. Fixed with__atomic_load_n(..., __ATOMIC_RELAXED)(theACQUIREload at the CAS site already provides the required happens-before for the newly-allocatedSBTable).waitForRefCountToClear/waitForAllRefCountsToClearskipped slots withcount==0without checkingactive_ptr. The pointer-first activation protocol storesactive_ptr(RELEASE) before the count increment (RELEASE), so a slot can be live (active_ptrset) whilecountis still zero. The scanner now checksactive_ptron zero-count slots to close this window.refCountGuard_ut.cppexercises this scenario.ASan fixes:
no_sanitize("bounds")extended to all architectures (was x86-only); the annotation guards an intentional out-of-bounds read used to detect array-vs-pointer calling conventions in the JVM.T-02
ProfiledThreadleak:clearCurrentThreadTLS()now returns the detached pointer instead ofvoid, and a companiondeleteForTest()helper (destructor is private) lets the test complete the simulated teardown. Previously, the object leaked becauserelease()found TLS already null and did nothing, and the pthread key destructor was never called for a null value.TSan infrastructure (aarch64 / Docker Desktop):
run-docker-tests.sh: installsclang-17+libclang-rt-17-devon aarch64 glibc images (GCC 11's TSan doesn't support 48-bit VMA); adds--privileged+sysctl vm.mmap_rnd_bits=28so TSan can map its shadow; passes-Pnative.forceCompiler=clang++-17.PlatformUtils.kt:locateLibtsan()now prefers clang's ownlibclang_rt.tsan-<arch>.soover GCC'slibtsan, matching the existinglocateLibasan()pattern.ConfigurationPresets.kt: uses dynamic lib name in TSan linker args; adds JVM-safeTSAN_OPTIONS(handle_segv/sigbus=0,use_sigaltstack=0,io_sync=0); setsGTEST_DEATH_TEST_STYLE=threadsafe(fork is unsafe under TSan).GtestTaskBuilder.kt: strips-ltsan/-lclang_rt.tsan*from gtest link so clang statically embeds its own TSan runtime, preventing GCC/clang runtime collision.ProfilerTestPlugin.kt: skips Java integration tests under TSan — the profiler's SIGPROF handler fires at 1ms inside TSan-instrumented code, causing re-entrance crashes. TSan coverage is provided by the C++ gtest suite (consistent with CI).tsan.supp: suppresses JVM-internallibnio.so/libjava.soFD-management races.CI:
allow_failure: trueremoved. Pipeline fails on any leak or data race.How to test the change?:
Expected:
BUILD SUCCESSFUL, all 31gtestTsan_*tasks pass,testTsan SKIPPED.credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.