Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions include/lib/allocation_tracker.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
#include "pevent.hpp"
#include "unlikely.hpp"

#include <atomic>
#include <cstddef>
#include <functional>
#include <mutex>
Expand Down Expand Up @@ -86,12 +85,11 @@ class AllocationTracker {
static void ensure_key_initialized();

private:
static void create_key();
static void delete_tl_state(void *tl_state);

// POSIX does not define an invalid pthread_key_t value, but implementations
// allocate keys starting from 0, so -1 (all bits set) is a safe sentinel.
static constexpr pthread_key_t kInvalidKey = static_cast<pthread_key_t>(-1);
static std::atomic<pthread_key_t> _tl_state_key;
static pthread_once_t _tl_key_once;
static pthread_key_t _tl_state_key;
static constexpr unsigned k_ratio_max_elt_to_bitset_size = 16;

// NOLINTBEGIN(misc-non-private-member-variables-in-classes)
Expand Down
65 changes: 31 additions & 34 deletions src/lib/allocation_tracker.cc
Original file line number Diff line number Diff line change
Expand Up @@ -18,20 +18,21 @@
#include "tsc_clock.hpp"

#include <algorithm>
#include <atomic>
#include <cassert>
#include <chrono>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <new>
#include <pthread.h>
#include <unistd.h>

namespace ddprof {

AllocationTracker *AllocationTracker::_instance;
std::atomic<pthread_key_t> AllocationTracker::_tl_state_key{
AllocationTracker::kInvalidKey};
pthread_once_t AllocationTracker::_tl_key_once = PTHREAD_ONCE_INIT;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user pointed to a crash that could involve pthread once.
I was not able to reproduce it locally.
I also do not fully trust the differences between musl and glibc for pthread once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see (with Claude assist), neither glibc nor musl's pthread_once can crash from correct usage.

Both implementations:

  • Use a CAS on an int (the pthread_once_t control variable)
  • The only memory accesses are to the control variable itself (atomically) and a futex wake/wait
  • Both always return 0 (musl never returns an error, glibc always returns 0)
  • Both handle thread cancellation during init_routine via cleanup handlers

Could it be a sequence of double initialization maybe ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The only scenario where musl could deadlock is if we fork-during-init: If fork() happens while pthread_once is executing create_key(), the child process deadlocks on musl. We could mitigate it by calling ensure_key_initialized() from the library constructor if this is a concern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, scenario added:

Fork safety

ddprof is statically linked with musl and loaded into glibc processes. musl's pthread_once has no fork-generation mechanism: if fork() happens while pthread_once is in-progress, the child deadlocks because musl's futex wait will never be woken (the initializing thread doesn't exist in the child, and musl's fork handlers aren't registered when musl isn't the host libc).

To eliminate this race, ensure_key_initialized() is called from the library constructor (ProfilerAutoStart), before any thread can call fork(). This guarantees pthread_once has completed by the time threads exist.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking carefully about this. At the moment, I still think keeping the atomic is simpler.

Just to fix slightly what is above: we try not to rely on libc implementation for the library part (what is injected into the process). We build against musl, but deploy and inject into binaries that depend on glibc. So the library does not pull libc as an explicit dependency.

We looked at the ABI for pthread_once and in theory it should be OK, but I still think relying on the atomic is better.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ddprof is statically linked with musl and loaded into glibc processes. musl's pthread_once has no fork-generation mechanism: if fork() happens while pthread_once is in-progress, the child deadlocks because musl's futex wait will never be woken (the initializing thread doesn't exist in the child, and musl's fork handlers aren't registered when musl isn't the host libc).

libddprof is not linked statically with musl (I don't think that's possible for a shared library to statically link libc): libddprof does not depend directly on a libc but makes the assumption that one will be loaded before libddprof is itself loaded. Hence when libddprof is loaded on a glibc host it will use glibc functions, and no musl code will ever be executed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call pthread_key_create in allocation_tracking_init ?
At this point symbol interposition is not yet active, so there should be no concurrent calls to init_tl_state / get_tl_state. No need for pthread_once / atomic.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the tl_state of existing threads (other than the one executing allocation_tracking_init) will never get initialized, since init_tl_state is called upon thread creation (we had many issues with lazily creating the thread local state).

pthread_key_t AllocationTracker::_tl_state_key;

namespace {

Expand All @@ -50,51 +51,49 @@ DDPROF_NOINLINE auto sleep_and_retry_reserve(MPSCRingBufferWriter &writer,
}
return Buffer{};
}
// Abort with a diagnostic if a pthread call returns an error.
// pthread functions return the error code directly (they do not set errno),
// so perror() would print the wrong message.
#define PTHREAD_CHECK(call) \
do { \
int err_ = (call); \
if (err_ != 0) { \
fprintf(stderr, "ddprof: %s failed: %s\n", #call, strerror(err_)); \
std::abort(); \
} \
} while (0)

} // namespace

void AllocationTracker::delete_tl_state(void *tl_state) {
delete static_cast<TrackerThreadLocalState *>(tl_state);
}

void AllocationTracker::create_key() {
PTHREAD_CHECK(pthread_key_create(&_tl_state_key, delete_tl_state));
}

void AllocationTracker::ensure_key_initialized() {
if (_tl_state_key.load(std::memory_order_acquire) != kInvalidKey) {
return;
}
pthread_key_t new_key;
if (pthread_key_create(&new_key, delete_tl_state) != 0) {
return;
}
pthread_key_t expected = kInvalidKey;
if (!_tl_state_key.compare_exchange_strong(expected, new_key,
std::memory_order_release)) {
// Another thread beat us, discard our key
pthread_key_delete(new_key);
}
PTHREAD_CHECK(pthread_once(&_tl_key_once, create_key));
}

TrackerThreadLocalState *AllocationTracker::get_tl_state() {
const pthread_key_t key = _tl_state_key.load(std::memory_order_relaxed);
if (key == kInvalidKey) {
return nullptr;
}
return static_cast<TrackerThreadLocalState *>(pthread_getspecific(key));
return static_cast<TrackerThreadLocalState *>(
pthread_getspecific(_tl_state_key));
}

TrackerThreadLocalState *AllocationTracker::init_tl_state() {
// acquire pairs with the release in ensure_key_initialized(): guarantees
// that if we see a valid key the pthread key is fully published and we won't
// silently return nullptr and drop the thread's initial allocations.
const pthread_key_t key = _tl_state_key.load(std::memory_order_acquire);
if (key == kInvalidKey) {
return nullptr;
}
// Free any previously set state (e.g. inherited across fork).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a fork leak, but we can not fully address it here...
TLS destructors are not called on fork when you have several threads.

I could still add this to my PR, it does not do any harm (though I doubt we should be hitting this path):

  • for the main thread that forked, we should keep existing state & not call the init again.
  • for other threads, we would lose the state (and leak it)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a real issue with the pthread approach, thanks for raising

delete static_cast<TrackerThreadLocalState *>(
pthread_getspecific(_tl_state_key));

auto *state = new (std::nothrow) TrackerThreadLocalState{};
if (!state) {
return nullptr;
}
state->tid = ddprof::gettid();
state->stack_bounds = retrieve_stack_bounds();
pthread_setspecific(key, state);
PTHREAD_CHECK(pthread_setspecific(_tl_state_key, state));
return state;
}

Expand Down Expand Up @@ -200,11 +199,9 @@ void AllocationTracker::free() {
pevent_munmap_event(&_pevent);

// The pthread key (_tl_state_key) is intentionally not deleted here.
// pthread_key_delete would race with concurrent get_tl_state() calls that
// already loaded the key value but haven't called pthread_getspecific yet.
// The cost is one leaked key per dlclose/reload cycle, which is acceptable:
// POSIX guarantees at least PTHREAD_KEYS_MAX (128) keys per process, and
// library reload is not a supported use case.
// pthread_key_delete would race with concurrent get_tl_state() calls.
// The cost is one leaked key per dlclose/reload cycle, acceptable given
// POSIX guarantees at least PTHREAD_KEYS_MAX (128) keys per process.

// Do not destroy the object:
// there is an inherent race condition between checking
Expand Down
4 changes: 4 additions & 0 deletions src/lib/dd_profiling.cc
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,10 @@ struct ProfilerAutoStart {

ProfilerAutoStart() noexcept {
init_state();
// Create the pthread key early, before any thread can fork.
// This avoids a deadlock on musl if fork() races with pthread_once
// (musl has no fork-generation mechanism to recover in-progress state).
AllocationTracker::ensure_key_initialized();

// Note that library needs to be linked with `--no-as-needed` when using
// autostart, otherwise linker will completely remove library from DT_NEEDED
Expand Down
7 changes: 1 addition & 6 deletions test/allocation_tracker-bench.cc
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,6 @@ static void BM_GetTlState(benchmark::State &state) {
std::atomic<int> ready_count{0};
std::atomic<bool> go{false};
std::atomic<bool> done{false};
std::atomic<int64_t> total_ops{0};

std::vector<std::thread> workers;
workers.reserve(nb_threads);
Expand All @@ -351,13 +350,10 @@ static void BM_GetTlState(benchmark::State &state) {
break;
}
// Hot loop: pure TLS access
int64_t ops = 0;
for (int j = 0; j < k_ops_per_iter; ++j) {
auto *tl = ddprof::AllocationTracker::get_tl_state();
benchmark::DoNotOptimize(tl);
++ops;
}
total_ops.fetch_add(ops, std::memory_order_relaxed);
ready_count.fetch_sub(1, std::memory_order_release);
// wait for go to be lowered before looping back
while (go.load(std::memory_order_acquire) &&
Expand All @@ -367,15 +363,14 @@ static void BM_GetTlState(benchmark::State &state) {
}

for (auto _ : state) {
total_ops.store(0, std::memory_order_relaxed);
// Wait for all workers to be ready
while (ready_count.load(std::memory_order_acquire) != nb_threads) {}
go.store(true, std::memory_order_release);
// Wait for all workers to finish the hot loop
while (ready_count.load(std::memory_order_acquire) != 0) {}
go.store(false, std::memory_order_release);
state.SetItemsProcessed(total_ops.load(std::memory_order_relaxed));
}
state.SetItemsProcessed(state.iterations() * nb_threads * k_ops_per_iter);

done.store(true, std::memory_order_release);
go.store(true, std::memory_order_release); // unblock workers
Expand Down
4 changes: 3 additions & 1 deletion test/allocation_tracker_fork_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,9 @@ int main() {
const LogHandle log_handle(LL_NOTICE);
LG_NTC("allocation_tracker_fork_test starting");

// Before key initialization, get_tl_state() must return NULL.
// Before ensure_key_initialized(), get_tl_state() returns NULL because
// pthread_getspecific on the zero-initialized (uncreated) key returns NULL
// on both glibc and musl.
auto *pre_init = AllocationTracker::get_tl_state();
CHECK_OR_RETURN(pre_init == nullptr, "expected NULL before key init (got %p)",
static_cast<void *>(pre_init));
Expand Down