Skip to content

fix(tsan): clean TSan gtest suite; fix races and RefCountGuard activation window#554

Merged
jbachorik merged 8 commits into
mainfrom
jb/sanitizers
May 28, 2026
Merged

fix(tsan): clean TSan gtest suite; fix races and RefCountGuard activation window#554
jbachorik merged 8 commits into
mainfrom
jb/sanitizers

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 28, 2026

What does this PR do?:

Makes the TSan gtest suite pass cleanly on aarch64 (Docker Desktop / linuxkit kernel with 48-bit VMA), fixes two real data races plus a correctness bug in RefCountGuard that TSan exposed, and closes two ASan-reported issues (bounds-annotation gap and a test-fixture memory leak).

Motivation:

TSan gtests were crashing before any test ran due to a VMA mismatch between GCC 11's libtsan (39-bit VMA only) and Docker Desktop's linuxkit kernel (48-bit VMA). Once the environment was fixed, TSan found two real races and revealed the RefCountGuard activation-window gap. ASan found a missing no_sanitize(bounds) on non-x86 architectures and a ProfiledThread leak in the T-02 test fixture. All four sanitizer jobs are now blocking — pipeline fails on any finding.

Additional Notes:

Races fixed (found by TSan):

  • callTraceHashTable.cpp: plain read of _table in put() raced with the atomic CAS in the table-expansion path. Fixed with __atomic_load_n(..., __ATOMIC_ACQUIRE) to pair with the ACQ_REL CAS.

  • stringDictionary.h: plain read of row->next before the overflow-chain CAS raced with another thread's CAS. Fixed with __atomic_load_n(..., __ATOMIC_RELAXED) (the ACQUIRE load at the CAS site already provides the required happens-before for the newly-allocated SBTable).

  • waitForRefCountToClear / waitForAllRefCountsToClear skipped slots with count==0 without checking active_ptr. The pointer-first activation protocol stores active_ptr (RELEASE) before the count increment (RELEASE), so a slot can be live (active_ptr set) while count is still zero. The scanner now checks active_ptr on zero-count slots to close this window. refCountGuard_ut.cpp exercises this scenario.

ASan fixes:

  • no_sanitize("bounds") extended to all architectures (was x86-only); the annotation guards an intentional out-of-bounds read used to detect array-vs-pointer calling conventions in the JVM.

  • T-02 ProfiledThread leak: clearCurrentThreadTLS() now returns the detached pointer instead of void, and a companion deleteForTest() helper (destructor is private) lets the test complete the simulated teardown. Previously, the object leaked because release() found TLS already null and did nothing, and the pthread key destructor was never called for a null value.

TSan infrastructure (aarch64 / Docker Desktop):

  • run-docker-tests.sh: installs clang-17 + libclang-rt-17-dev on aarch64 glibc images (GCC 11's TSan doesn't support 48-bit VMA); adds --privileged + sysctl vm.mmap_rnd_bits=28 so TSan can map its shadow; passes -Pnative.forceCompiler=clang++-17.
  • PlatformUtils.kt: locateLibtsan() now prefers clang's own libclang_rt.tsan-<arch>.so over GCC's libtsan, matching the existing locateLibasan() pattern.
  • ConfigurationPresets.kt: uses dynamic lib name in TSan linker args; adds JVM-safe TSAN_OPTIONS (handle_segv/sigbus=0, use_sigaltstack=0, io_sync=0); sets GTEST_DEATH_TEST_STYLE=threadsafe (fork is unsafe under TSan).
  • GtestTaskBuilder.kt: strips -ltsan/-lclang_rt.tsan* from gtest link so clang statically embeds its own TSan runtime, preventing GCC/clang runtime collision.
  • ProfilerTestPlugin.kt: skips Java integration tests under TSan — the profiler's SIGPROF handler fires at 1ms inside TSan-instrumented code, causing re-entrance crashes. TSan coverage is provided by the C++ gtest suite (consistent with CI).
  • tsan.supp: suppresses JVM-internal libnio.so/libjava.so FD-management races.

CI:

  • All four sanitizer jobs (ASan amd64/arm64, TSan amd64/arm64) are now blocking — allow_failure: true removed. Pipeline fails on any leak or data race.

How to test the change?:

./utils/run-docker-tests.sh --config=tsan --gtest

Expected: BUILD SUCCESSFUL, all 31 gtestTsan_* tasks pass, testTsan SKIPPED.

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: [JIRA-XXXX]

jbachorik and others added 2 commits May 28, 2026 13:32
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…untGuard

Races fixed by TSan:
- callTraceHashTable.cpp: non-atomic read of _table in put() vs atomic CAS
  expansion; add __atomic_load_n ACQUIRE to pair with the ACQ_REL CAS.
- stringDictionary.h: non-atomic read of row->next before overflow CAS;
  use __atomic_load_n RELAXED (the ACQUIRE on line 360 provides the
  happens-before for the newly-allocated SBTable).

RefCountGuard activation-window fix:
- waitForRefCountToClear / waitForAllRefCountsToClear skipped slots with
  count==0 without checking active_ptr. The pointer-first protocol means
  active_ptr is stored (RELEASE) before count++ (RELEASE), so count==0
  with active_ptr set is a live activation window. Add the active_ptr
  ACQUIRE check to close it.
- Add refCountGuard_ut.cpp to exercise the activation window scenario.
- Update header docs to describe the corrected scanner protocol.

TSan infrastructure (aarch64 / Docker Desktop):
- run-docker-tests.sh: install clang-17 + libclang-rt-17-dev on aarch64
  glibc images; add --privileged + sysctl vm.mmap_rnd_bits=28 for TSan;
  pass -Pnative.forceCompiler=clang++-17 (GCC 11's libtsan has 39-bit
  VMA only; Docker Desktop linuxkit uses 48-bit VMA).
- PlatformUtils.kt: locateLibtsan() prefers clang's own
  libclang_rt.tsan-<arch>.so over GCC's libtsan.
- ConfigurationPresets.kt: use dynamic lib name in TSan linker args
  (matches configureAsan pattern); add JVM-safe TSAN_OPTIONS
  (handle_segv/sigbus=0, use_sigaltstack=0, io_sync=0); set
  GTEST_DEATH_TEST_STYLE=threadsafe (fork unsupported under TSan).
- GtestTaskBuilder.kt: strip -ltsan/-lclang_rt.tsan* from gtest link so
  clang statically embeds its own TSan runtime (prevents GCC/clang
  runtime collision).
- ProfilerTestPlugin.kt: skip Java integration tests under TSan (SIGPROF
  at 1ms fires in TSan-instrumented code; re-entrance crashes the JVM).
  TSan coverage is provided by the C++ gtest suite.
- tsan.supp: suppress JVM-internal libnio.so/libjava.so FD races.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik added the AI label May 28, 2026
@jbachorik jbachorik marked this pull request as ready for review May 28, 2026 16:35
@jbachorik jbachorik requested a review from a team as a code owner May 28, 2026 16:35
@jbachorik jbachorik requested a review from Copilot May 28, 2026 16:35
@jbachorik jbachorik marked this pull request as draft May 28, 2026 16:35
@jbachorik
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes TSan-detected races and updates the TSan test/build path, especially for aarch64 Docker Desktop environments with 48-bit VMA support needs.

Changes:

  • Fixes atomic access races in CallTraceHashTable and StringDictionary.
  • Closes the RefCountGuard activation-window gap and adds regression coverage.
  • Updates TSan Docker/build/test configuration to use compatible clang runtimes and skip incompatible JVM integration tests.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
utils/run-docker-tests.sh Adds aarch64 clang-17 setup and privileged/sysctl handling for TSan.
gradle/sanitizers/tsan.supp Adds JVM library race suppressions.
ddprof-lib/src/test/cpp/refCountGuard_ut.cpp Adds activation-window regression tests.
ddprof-lib/src/main/cpp/stringDictionary.h Uses an atomic relaxed load for overflow-chain hinting.
ddprof-lib/src/main/cpp/refCountGuard.h Updates protocol documentation for active pointer scanning.
ddprof-lib/src/main/cpp/refCountGuard.cpp Checks active_ptr during zero-count wait scans.
ddprof-lib/src/main/cpp/callTraceHashTable.cpp Uses acquire load for _table in put().
build-logic/conventions/src/main/kotlin/com/datadoghq/profiler/ProfilerTestPlugin.kt Skips JVM integration tests under TSan.
build-logic/conventions/src/main/kotlin/com/datadoghq/native/util/PlatformUtils.kt Prefers clang TSan runtime when using clang.
build-logic/conventions/src/main/kotlin/com/datadoghq/native/gtest/GtestTaskBuilder.kt Strips explicit TSan runtime link flags for gtest executables.
build-logic/conventions/src/main/kotlin/com/datadoghq/native/config/ConfigurationPresets.kt Updates TSan linker args and test environment options.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ddprof-lib/src/main/cpp/stringDictionary.h Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2560589453

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…tion

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik marked this pull request as ready for review May 28, 2026 16:47
@jbachorik jbachorik requested a review from zhengyu123 May 28, 2026 16:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21e76a33c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@datadog-prod-us1-6

This comment has been minimized.

clearCurrentThreadTLS() now returns the detached pointer; test deletes
it via deleteForTest() after the race-window assertion. Removes
allow_failure from all four sanitizer jobs so pipeline fails on leaks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 28, 2026

CI Test Results

Run: #26598758953 | Commit: 6d5ee81 | Duration: 12m 23s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-05-28 20:13:03 UTC

jbachorik and others added 2 commits May 28, 2026 21:15
[ rc -ne 0 ] && {...} exits 1 when rc=0 (condition false), making the
pipeline exit 1 even on success. Use if/fi (exits 0 on false branch)
matching the ASan base template.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dispatch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik merged commit ed3b012 into main May 28, 2026
103 checks passed
@jbachorik jbachorik deleted the jb/sanitizers branch May 28, 2026 20:27
@github-actions github-actions Bot added this to the 1.44.0 milestone May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants