fix(calltrace): fix three SIGSEGV-inducing defects in clearTableOnly/collect (PROF-14889)#585
fix(calltrace): fix three SIGSEGV-inducing defects in clearTableOnly/collect (PROF-14889)#585jbachorik wants to merge 17 commits into
Conversation
- Fix inaccurate comments: clearTableOnly active-table path comment (no prior caller drain; protection is lockAll + in-function wait), decrementCounters precondition (targeted wait, not global), and RELEASE store comment expanded for aarch64 clarity - Fix header comment: _table is accessed with ACQUIRE/RELEASE/ACQ_REL ordering; document exceptions for plain-load callers - Drop redundant const_cast in collect() loop initialiser - Use atomic ACQUIRE loads in putWithExistingId() for consistency - Strengthen ClearTableOnlyDisconnectsFullChain: second processTraces() now asserts count==1 (sentinel only) for deterministic defect detection - ProcessCallTracesRaceTest: pre-allocate one temp file per dump thread to avoid filesystem churn; widen awaitTermination to +30 s
|
CI Test ResultsRun: #27147595979 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-06-08 15:34:37 UTC |
…warning Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a set of concurrency and memory-safety issues in CallTraceHashTable that can lead to sporadic SIGSEGVs during JFR dumps under high-rate wall-clock profiling, and adds regression coverage (unit, stress, fuzz, and Java integration) to prevent recurrence.
Changes:
- Fixes
clearTableOnly()refcount draining and_prev-chain disconnection to avoid races/UAF during table rotation. - Makes
_tablepublication/consumption use explicit acquire/release atomics for correctness on weakly ordered architectures. - Adds new regression tests (C++ unit/stress, fuzz target, Java race test) plus build-logic tweaks for fuzz and sanitizer behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| ddprof-lib/src/main/cpp/callTraceHashTable.cpp | Implements targeted refcount drain, correct _prev chain clearing, and atomic _table loads/stores in clear/collect paths. |
| ddprof-lib/src/main/cpp/callTraceHashTable.h | Documents the _table synchronization protocol. |
| ddprof-lib/src/main/cpp/linearAllocator.cpp | Adds sanitizer (ASan/TSan) state handling around chunk lifecycle to reduce false positives / stale-state issues. |
| ddprof-lib/src/test/cpp/test_callTraceStorage.cpp | Adds focused C++ unit regressions for expanded-table chain handling and collection completeness. |
| ddprof-lib/src/test/cpp/stress_callTraceStorage.cpp | Adds a stress regression that forces expansion and exercises concurrent put + processTraces. |
| ddprof-lib/src/test/fuzz/fuzz_callTraceStorage.cpp | Introduces a libFuzzer target to exercise put/processTraces/clear sequences across expansion. |
| ddprof-lib/src/test/fuzz/corpus/fuzz_callTraceStorage/seed0 | Adds a seed input for the new fuzz target. |
| ddprof-lib/src/test/fuzz/README.md | Documents the new fuzz target and what it is intended to detect. |
| ddprof-test/src/test/java/com/datadoghq/profiler/jfr/ProcessCallTracesRaceTest.java | Adds a Java integration regression reproducing dump-under-load timing/race conditions. |
| build-logic/conventions/src/main/kotlin/com/datadoghq/native/util/PlatformUtils.kt | Improves fuzzer toolchain detection by link-checking -fsanitize=fuzzer. |
| build-logic/conventions/src/main/kotlin/com/datadoghq/native/gtest/GtestTaskBuilder.kt | Adjusts sanitizer option handling for native gtest execution. |
| build-logic/conventions/src/main/kotlin/com/datadoghq/native/fuzz/FuzzTargetsPlugin.kt | Improves messaging when fuzz targets are skipped due to missing libFuzzer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What does this PR do?:
Fixes three independent defects in
CallTraceHashTablethat together produce a sporadic null-deref SIGSEGV inProfiler::processCallTracesduring JFR snapshot dumps under high-rate wall-clock profiling.clearTableOnly()calledwaitForAllRefCountsToClear()(global wait) instead ofwaitForRefCountToClear(this)(targeted). Under high signal rate the global wait times out whileput()calls to_activeare still in flight. The function then proceeds, leavingcollect()on the old-active racing with a still-runningput().callTraceHashTable.cpp:144processCallTracescollect()andclearTableOnly()read_tablewith a plain (non-atomic) load whileput()can CAS-expand_tableon another thread (ACQ_REL CAS). On ARM64 (weak memory ordering) this is a data race — the reader may see a stale or partially-initialised pointer.callTraceHashTable.cpp:410, 175_prev-chain-clearing loop inclearTableOnly()advanced viatable->prev()aftersetPrev(nullptr)already cleared the link, so the loop exited after the first node on an expanded (multi-node) table.callTraceHashTable.cpp:148–153_prevpointers left in freed memory; would corruptcollect()on memory reuseMotivation:
PROF-14889: customer JVMs running Java 25 under heavy wall-clock profiling crash with SIGSEGV in
processCallTraces. The crash is probabilistic and hard to reproduce because it requires a specific timing: adump()call must arrive while a high-rate signal storm is fillingCallTraceStorage, triggering the global-wait timeout in defect A and exposing the use-after-free window.Defect B is a latent correctness issue on aarch64 that TSan would flag; defect C is a latent memory-safety issue that would manifest under allocator reuse.
Additional Notes:
RefCountGuard::waitForRefCountToClear(this)returns immediately for standby/scratch tables (they are never_active_storage) and correctly drains only in-flight puts to the table being cleared._tablereads that can race withput()expansion now use__atomic_load_n(..., __ATOMIC_ACQUIRE); the reinitialisation write inclearTableOnly()uses__atomic_store_n(..., __ATOMIC_RELEASE). ThedecrementCounters()andputWithExistingId()plain reads remain safe because their callers guarantee no concurrent writer (lockAll()-guarded clear path or single-threaded scratch/standby contract).table->prev()to a local before callingsetPrev(nullptr), then advances via the saved pointer._tablefield comment incallTraceHashTable.his updated to document the acquire/release protocol.decrementCounters()comment that referenced the removedwaitForAllRefCountsToClear()is updated to reflect the actual safety invariant.How to test the change?:
Two new C++ unit tests in
test_callTraceStorage.cpp:ClearTableOnlyDisconnectsFullChain— inserts 50 000 entries to force table expansion, then callsprocessTraces()twice and asserts no crash / no dangling pointer corruption (targets defect C).CollectFindsAllTracesAcrossExpandedChain— inserts 50 000 entries, callsprocessTraces(), and asserts every inserted trace ID appears in the snapshot (targets defect B).One new Java integration test
ProcessCallTracesRaceTest:dump()threads for 20 seconds underwall=1ms,cpu=1ms.CI runs the C++ unit tests (
ddprof-lib:testRelease) and the Java integration tests (ddprof-test:testRelease) on all supported platforms including aarch64.For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.