Skip to content

fix(calltrace): fix three SIGSEGV-inducing defects in clearTableOnly/collect (PROF-14889)#585

Open
jbachorik wants to merge 17 commits into
mainfrom
jb/fix-process-call-traces-crash
Open

fix(calltrace): fix three SIGSEGV-inducing defects in clearTableOnly/collect (PROF-14889)#585
jbachorik wants to merge 17 commits into
mainfrom
jb/fix-process-call-traces-crash

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented Jun 8, 2026

What does this PR do?:

Fixes three independent defects in CallTraceHashTable that together produce a sporadic null-deref SIGSEGV in Profiler::processCallTraces during JFR snapshot dumps under high-rate wall-clock profiling.

# Defect Location Effect
A clearTableOnly() called waitForAllRefCountsToClear() (global wait) instead of waitForRefCountToClear(this) (targeted). Under high signal rate the global wait times out while put() calls to _active are still in flight. The function then proceeds, leaving collect() on the old-active racing with a still-running put(). callTraceHashTable.cpp:144 Sporadic null-deref SIGSEGV in processCallTraces
B collect() and clearTableOnly() read _table with a plain (non-atomic) load while put() can CAS-expand _table on another thread (ACQ_REL CAS). On ARM64 (weak memory ordering) this is a data race — the reader may see a stale or partially-initialised pointer. callTraceHashTable.cpp:410, 175 Missed traces; potential crash on aarch64; TSan data-race report
C The _prev-chain-clearing loop in clearTableOnly() advanced via table->prev() after setPrev(nullptr) already cleared the link, so the loop exited after the first node on an expanded (multi-node) table. callTraceHashTable.cpp:148–153 Dangling _prev pointers left in freed memory; would corrupt collect() on memory reuse

Motivation:

PROF-14889: customer JVMs running Java 25 under heavy wall-clock profiling crash with SIGSEGV in processCallTraces. The crash is probabilistic and hard to reproduce because it requires a specific timing: a dump() call must arrive while a high-rate signal storm is filling CallTraceStorage, triggering the global-wait timeout in defect A and exposing the use-after-free window.

Defect B is a latent correctness issue on aarch64 that TSan would flag; defect C is a latent memory-safety issue that would manifest under allocator reuse.

Additional Notes:

  • Defect A fix: RefCountGuard::waitForRefCountToClear(this) returns immediately for standby/scratch tables (they are never _active_storage) and correctly drains only in-flight puts to the table being cleared.
  • Defect B fix: all _table reads that can race with put() expansion now use __atomic_load_n(..., __ATOMIC_ACQUIRE); the reinitialisation write in clearTableOnly() uses __atomic_store_n(..., __ATOMIC_RELEASE). The decrementCounters() and putWithExistingId() plain reads remain safe because their callers guarantee no concurrent writer (lockAll()-guarded clear path or single-threaded scratch/standby contract).
  • Defect C fix: the chain-clearing loop now saves table->prev() to a local before calling setPrev(nullptr), then advances via the saved pointer.
  • The _table field comment in callTraceHashTable.h is updated to document the acquire/release protocol.
  • The existing decrementCounters() comment that referenced the removed waitForAllRefCountsToClear() is updated to reflect the actual safety invariant.

How to test the change?:

Two new C++ unit tests in test_callTraceStorage.cpp:

  • ClearTableOnlyDisconnectsFullChain — inserts 50 000 entries to force table expansion, then calls processTraces() twice and asserts no crash / no dangling pointer corruption (targets defect C).
  • CollectFindsAllTracesAcrossExpandedChain — inserts 50 000 entries, calls processTraces(), and asserts every inserted trace ID appears in the snapshot (targets defect B).

One new Java integration test ProcessCallTracesRaceTest:

  • Spawns 64 CPU-hot worker threads and 4 concurrent dump() threads for 20 seconds under wall=1ms,cpu=1ms.
  • A SIGSEGV from defect A aborts the JVM, which Gradle/JUnit treats as a non-zero exit and fails the build.

CI runs the C++ unit tests (ddprof-lib:testRelease) and the Java integration tests (ddprof-test:testRelease) on all supported platforms including aarch64.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-14889

jbachorik added 7 commits June 3, 2026 17:57
- Fix inaccurate comments: clearTableOnly active-table path comment
  (no prior caller drain; protection is lockAll + in-function wait),
  decrementCounters precondition (targeted wait, not global), and
  RELEASE store comment expanded for aarch64 clarity
- Fix header comment: _table is accessed with ACQUIRE/RELEASE/ACQ_REL
  ordering; document exceptions for plain-load callers
- Drop redundant const_cast in collect() loop initialiser
- Use atomic ACQUIRE loads in putWithExistingId() for consistency
- Strengthen ClearTableOnlyDisconnectsFullChain: second processTraces()
  now asserts count==1 (sentinel only) for deterministic defect detection
- ProcessCallTracesRaceTest: pre-allocate one temp file per dump thread
  to avoid filesystem churn; widen awaitTermination to +30 s
@jbachorik jbachorik added the AI label Jun 8, 2026
@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 Bot commented Jun 8, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

DataDog/java-profiler | report-dd-trace-results   View in Datadog   GitLab

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 1cbef77 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Jun 8, 2026

CI Test Results

Run: #27147595979 | Commit: ce98116 | Duration: 14m 53s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-06-08 15:34:37 UTC

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a set of concurrency and memory-safety issues in CallTraceHashTable that can lead to sporadic SIGSEGVs during JFR dumps under high-rate wall-clock profiling, and adds regression coverage (unit, stress, fuzz, and Java integration) to prevent recurrence.

Changes:

  • Fixes clearTableOnly() refcount draining and _prev-chain disconnection to avoid races/UAF during table rotation.
  • Makes _table publication/consumption use explicit acquire/release atomics for correctness on weakly ordered architectures.
  • Adds new regression tests (C++ unit/stress, fuzz target, Java race test) plus build-logic tweaks for fuzz and sanitizer behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
ddprof-lib/src/main/cpp/callTraceHashTable.cpp Implements targeted refcount drain, correct _prev chain clearing, and atomic _table loads/stores in clear/collect paths.
ddprof-lib/src/main/cpp/callTraceHashTable.h Documents the _table synchronization protocol.
ddprof-lib/src/main/cpp/linearAllocator.cpp Adds sanitizer (ASan/TSan) state handling around chunk lifecycle to reduce false positives / stale-state issues.
ddprof-lib/src/test/cpp/test_callTraceStorage.cpp Adds focused C++ unit regressions for expanded-table chain handling and collection completeness.
ddprof-lib/src/test/cpp/stress_callTraceStorage.cpp Adds a stress regression that forces expansion and exercises concurrent put + processTraces.
ddprof-lib/src/test/fuzz/fuzz_callTraceStorage.cpp Introduces a libFuzzer target to exercise put/processTraces/clear sequences across expansion.
ddprof-lib/src/test/fuzz/corpus/fuzz_callTraceStorage/seed0 Adds a seed input for the new fuzz target.
ddprof-lib/src/test/fuzz/README.md Documents the new fuzz target and what it is intended to detect.
ddprof-test/src/test/java/com/datadoghq/profiler/jfr/ProcessCallTracesRaceTest.java Adds a Java integration regression reproducing dump-under-load timing/race conditions.
build-logic/conventions/src/main/kotlin/com/datadoghq/native/util/PlatformUtils.kt Improves fuzzer toolchain detection by link-checking -fsanitize=fuzzer.
build-logic/conventions/src/main/kotlin/com/datadoghq/native/gtest/GtestTaskBuilder.kt Adjusts sanitizer option handling for native gtest execution.
build-logic/conventions/src/main/kotlin/com/datadoghq/native/fuzz/FuzzTargetsPlugin.kt Improves messaging when fuzz targets are skipped due to missing libFuzzer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ddprof-lib/src/main/cpp/callTraceHashTable.h Outdated
Comment thread ddprof-lib/src/main/cpp/linearAllocator.cpp
Comment thread ddprof-lib/src/test/fuzz/README.md
Comment thread ddprof-lib/src/test/cpp/stress_callTraceStorage.cpp Outdated
Comment thread ddprof-lib/src/test/cpp/stress_callTraceStorage.cpp Outdated
Comment thread ddprof-lib/src/test/cpp/stress_callTraceStorage.cpp Outdated
@jbachorik jbachorik marked this pull request as ready for review June 8, 2026 15:15
@jbachorik jbachorik requested a review from a team as a code owner June 8, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants