refactor: simplify TLS key init with pthread_once and fix fork leak #523

r1viollet · 2026-03-24T14:52:45Z

A user pointed to a crash that could involve pthread once.
I was not able to reproduce it locally.
I also do not fully trust the differences between musl and glibc for pthread once.

As far as I can see (with Claude assist), neither glibc nor musl's pthread_once can crash from correct usage.

Both implementations:

Use a CAS on an int (the pthread_once_t control variable)

The only memory accesses are to the control variable itself (atomically) and a futex wake/wait

Both always return 0 (musl never returns an error, glibc always returns 0)

Both handle thread cancellation during init_routine via cleanup handlers

Could it be a sequence of double initialization maybe ?

Note: The only scenario where musl could deadlock is if we fork-during-init: If fork() happens while pthread_once is executing create_key(), the child process deadlocks on musl. We could mitigate it by calling ensure_key_initialized() from the library constructor if this is a concern.

Humm, scenario added:

Fork safety

ddprof is statically linked with musl and loaded into glibc processes. musl's pthread_once has no fork-generation mechanism: if fork() happens while pthread_once is in-progress, the child deadlocks because musl's futex wait will never be woken (the initializing thread doesn't exist in the child, and musl's fork handlers aren't registered when musl isn't the host libc).

To eliminate this race, ensure_key_initialized() is called from the library constructor (ProfilerAutoStart), before any thread can call fork(). This guarantees pthread_once has completed by the time threads exist.

Thanks for thinking carefully about this. At the moment, I still think keeping the atomic is simpler.

Just to fix slightly what is above: we try not to rely on libc implementation for the library part (what is injected into the process). We build against musl, but deploy and inject into binaries that depend on glibc. So the library does not pull libc as an explicit dependency.

We looked at the ABI for pthread_once and in theory it should be OK, but I still think relying on the atomic is better.

ddprof is statically linked with musl and loaded into glibc processes. musl's pthread_once has no fork-generation mechanism: if fork() happens while pthread_once is in-progress, the child deadlocks because musl's futex wait will never be woken (the initializing thread doesn't exist in the child, and musl's fork handlers aren't registered when musl isn't the host libc).

libddprof is not linked statically with musl (I don't think that's possible for a shared library to statically link libc): libddprof does not depend directly on a libc but makes the assumption that one will be loaded before libddprof is itself loaded. Hence when libddprof is loaded on a glibc host it will use glibc functions, and no musl code will ever be executed.

Why not call pthread_key_create in allocation_tracking_init ?
At this point symbol interposition is not yet active, so there should be no concurrent calls to init_tl_state / get_tl_state. No need for pthread_once / atomic.

Note that the tl_state of existing threads (other than the one executing allocation_tracking_init) will never get initialized, since init_tl_state is called upon thread creation (we had many issues with lazily creating the thread local state).

r1viollet · 2026-03-24T17:27:43Z

I think there is a fork leak, but we can not fully address it here...
TLS destructors are not called on fork when you have several threads.

I could still add this to my PR, it does not do any harm (though I doubt we should be hitting this path):

for the main thread that forked, we should keep existing state & not call the init again.

for other threads, we would lose the state (and leak it)

I think this could be a real issue with the pthread approach, thanks for raising

-Original file line number
+Diff line change
@@ Expand Up / @@ -12,7 +12,6 @@ @@
     #include "pevent.hpp"
     #include "unlikely.hpp"
-    #include <atomic>
     #include <cstddef>
     #include <functional>
     #include <mutex>
@@ Expand Down Expand Up / @@ -86,12 +85,11 @@ class AllocationTracker { @@
       static void ensure_key_initialized();
     private:
+      static void create_key();
       static void delete_tl_state(void *tl_state);
-      // POSIX does not define an invalid pthread_key_t value, but implementations
-      // allocate keys starting from 0, so -1 (all bits set) is a safe sentinel.
-      static constexpr pthread_key_t kInvalidKey = static_cast<pthread_key_t>(-1);
-      static std::atomic<pthread_key_t> _tl_state_key;
+      static pthread_once_t _tl_key_once;
+      static pthread_key_t _tl_state_key;
       static constexpr unsigned k_ratio_max_elt_to_bitset_size = 16;
       // NOLINTBEGIN(misc-non-private-member-variables-in-classes)
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -18,20 +18,21 @@ @@
     #include "tsc_clock.hpp"
     #include <algorithm>
-    #include <atomic>
     #include <cassert>
     #include <chrono>
     #include <cstdint>
+    #include <cstdio>
     #include <cstdlib>
+    #include <cstring>
     #include <new>
     #include <pthread.h>
     #include <unistd.h>
     namespace ddprof {
     AllocationTracker *AllocationTracker::_instance;
-    std::atomic<pthread_key_t> AllocationTracker::_tl_state_key{
-        AllocationTracker::kInvalidKey};
+    pthread_once_t AllocationTracker::_tl_key_once = PTHREAD_ONCE_INIT;
+    pthread_key_t AllocationTracker::_tl_state_key;
     namespace {
@@ Expand All @@
       }
       return Buffer{};
     }
+    // Abort with a diagnostic if a pthread call returns an error.
+    // pthread functions return the error code directly (they do not set errno),
+    // so perror() would print the wrong message.
+    #define PTHREAD_CHECK(call)                                                    \
+      do {                                                                         \
+        int err_ = (call);                                                         \
+        if (err_ != 0) {                                                           \
+          fprintf(stderr, "ddprof: %s failed: %s\n", #call, strerror(err_));       \
+          std::abort();                                                            \
+        }                                                                          \
+      } while (0)
     } // namespace
     void AllocationTracker::delete_tl_state(void *tl_state) {
       delete static_cast<TrackerThreadLocalState *>(tl_state);
     }
+    void AllocationTracker::create_key() {
+      PTHREAD_CHECK(pthread_key_create(&_tl_state_key, delete_tl_state));
+    }
     void AllocationTracker::ensure_key_initialized() {
-      if (_tl_state_key.load(std::memory_order_acquire) != kInvalidKey) {
-        return;
-      }
-      pthread_key_t new_key;
-      if (pthread_key_create(&new_key, delete_tl_state) != 0) {
-        return;
-      }
-      pthread_key_t expected = kInvalidKey;
-      if (!_tl_state_key.compare_exchange_strong(expected, new_key,
-                                                 std::memory_order_release)) {
-        // Another thread beat us, discard our key
-        pthread_key_delete(new_key);
-      }
+      PTHREAD_CHECK(pthread_once(&_tl_key_once, create_key));
     }
     TrackerThreadLocalState *AllocationTracker::get_tl_state() {
-      const pthread_key_t key = _tl_state_key.load(std::memory_order_relaxed);
-      if (key == kInvalidKey) {
-        return nullptr;
-      }
-      return static_cast<TrackerThreadLocalState *>(pthread_getspecific(key));
+      return static_cast<TrackerThreadLocalState *>(
+          pthread_getspecific(_tl_state_key));
     }
     TrackerThreadLocalState *AllocationTracker::init_tl_state() {
-      // acquire pairs with the release in ensure_key_initialized(): guarantees
-      // that if we see a valid key the pthread key is fully published and we won't
-      // silently return nullptr and drop the thread's initial allocations.
-      const pthread_key_t key = _tl_state_key.load(std::memory_order_acquire);
-      if (key == kInvalidKey) {
-        return nullptr;
-      }
+      // Free any previously set state (e.g. inherited across fork).
+      delete static_cast<TrackerThreadLocalState *>(
+          pthread_getspecific(_tl_state_key));
       auto *state = new (std::nothrow) TrackerThreadLocalState{};
       if (!state) {
         return nullptr;
       }
       state->tid = ddprof::gettid();
       state->stack_bounds = retrieve_stack_bounds();
-      pthread_setspecific(key, state);
+      PTHREAD_CHECK(pthread_setspecific(_tl_state_key, state));
       return state;
     }
@@ Expand Down Expand Up / @@ -200,11 +199,9 @@ void AllocationTracker::free() { @@
       pevent_munmap_event(&_pevent);
       // The pthread key (_tl_state_key) is intentionally not deleted here.
-      // pthread_key_delete would race with concurrent get_tl_state() calls that
-      // already loaded the key value but haven't called pthread_getspecific yet.
-      // The cost is one leaked key per dlclose/reload cycle, which is acceptable:
-      // POSIX guarantees at least PTHREAD_KEYS_MAX (128) keys per process, and
-      // library reload is not a supported use case.
+      // pthread_key_delete would race with concurrent get_tl_state() calls.
+      // The cost is one leaked key per dlclose/reload cycle, acceptable given
+      // POSIX guarantees at least PTHREAD_KEYS_MAX (128) keys per process.
       // Do not destroy the object:
       // there is an inherent race condition between checking
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -236,6 +236,10 @@ struct ProfilerAutoStart { @@
       ProfilerAutoStart() noexcept {
         init_state();
+        // Create the pthread key early, before any thread can fork.
+        // This avoids a deadlock on musl if fork() races with pthread_once
+        // (musl has no fork-generation mechanism to recover in-progress state).
+        AllocationTracker::ensure_key_initialized();
         // Note that library needs to be linked with `--no-as-needed` when using
         // autostart, otherwise linker will completely remove library from DT_NEEDED
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -335,7 +335,6 @@ static void BM_GetTlState(benchmark::State &state) { @@
       std::atomic<int> ready_count{0};
       std::atomic<bool> go{false};
       std::atomic<bool> done{false};
-      std::atomic<int64_t> total_ops{0};
       std::vector<std::thread> workers;
       workers.reserve(nb_threads);
@@ Expand All / @@ -351,13 +350,10 @@ static void BM_GetTlState(benchmark::State &state) { @@
               break;
             }
             // Hot loop: pure TLS access
-            int64_t ops = 0;
             for (int j = 0; j < k_ops_per_iter; ++j) {
               auto *tl = ddprof::AllocationTracker::get_tl_state();
               benchmark::DoNotOptimize(tl);
-              ++ops;
             }
-            total_ops.fetch_add(ops, std::memory_order_relaxed);
             ready_count.fetch_sub(1, std::memory_order_release);
             // wait for go to be lowered before looping back
             while (go.load(std::memory_order_acquire) &&
@@ Expand All / @@ -367,15 +363,14 @@ static void BM_GetTlState(benchmark::State &state) { @@
       }
       for (auto _ : state) {
-        total_ops.store(0, std::memory_order_relaxed);
         // Wait for all workers to be ready
         while (ready_count.load(std::memory_order_acquire) != nb_threads) {}
         go.store(true, std::memory_order_release);
         // Wait for all workers to finish the hot loop
         while (ready_count.load(std::memory_order_acquire) != 0) {}
         go.store(false, std::memory_order_release);
-        state.SetItemsProcessed(total_ops.load(std::memory_order_relaxed));
       }
+      state.SetItemsProcessed(state.iterations() * nb_threads * k_ops_per_iter);
       done.store(true, std::memory_order_release);
       go.store(true, std::memory_order_release); // unblock workers
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -81,7 +81,9 @@ int main() { @@
       const LogHandle log_handle(LL_NOTICE);
       LG_NTC("allocation_tracker_fork_test starting");
-      // Before key initialization, get_tl_state() must return NULL.
+      // Before ensure_key_initialized(), get_tl_state() returns NULL because
+      // pthread_getspecific on the zero-initialized (uncreated) key returns NULL
+      // on both glibc and musl.
       auto *pre_init = AllocationTracker::get_tl_state();
       CHECK_OR_RETURN(pre_init == nullptr, "expected NULL before key init (got %p)",
                       static_cast<void *>(pre_init));
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: simplify TLS key init with pthread_once and fix fork leak #523

Uh oh!

Diff view

Diff view

There are no files selected for viewing

r1viollet Mar 24, 2026

Uh oh!

xroche Mar 24, 2026

Uh oh!

xroche Mar 24, 2026

Uh oh!

xroche Mar 24, 2026

Uh oh!

r1viollet Mar 24, 2026

Uh oh!

nsavoire Mar 24, 2026

Uh oh!

nsavoire Mar 24, 2026

Uh oh!

nsavoire Mar 24, 2026

Uh oh!

r1viollet Mar 24, 2026

Uh oh!

r1viollet Mar 24, 2026

Uh oh!

refactor: simplify TLS key init with pthread_once and fix fork leak #523

Uh oh!

refactor: simplify TLS key init with pthread_once and fix fork leak #523

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Fork safety

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!