Skip to content

fix(context): scope OTEL ThreadContext storage to the carrier thread (PROF-15271)#625

Merged
rkennke merged 8 commits into
mainfrom
rkennke/PROF-15271
Jul 2, 2026
Merged

fix(context): scope OTEL ThreadContext storage to the carrier thread (PROF-15271)#625
rkennke merged 8 commits into
mainfrom
rkennke/PROF-15271

Conversation

@rkennke

@rkennke rkennke commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?:

Scopes OTEL ThreadContext storage to the carrier thread when running under virtual threads, replacing the plain ThreadLocal that keyed it by the virtual thread. Introduced as a selectable mode (OtelContextStorage.Mode.CARRIER) backed by jdk.internal.misc.CarrierThreadLocal, with a kill-switch and a safe fallback to today's behavior.

Motivation:

The ThreadContext's DirectByteBuffer is a window into the OTEP record embedded in the carrier's native ProfiledThread — the record the (carrier-bound) sampler reads. Caching it in a plain ThreadLocal keys it by the virtual thread and pins it to whichever carrier was mounted at first use. This is:

  • Wrong once the vthread migrates — writes land on the old carrier, so a sampler interrupting the new carrier sees stale/empty context (same family as the attribute-loss issue tracked in dd-trace-java #11646); and
  • Unsafe once the old carrier's OS thread exits — freeKey/release() frees the ProfiledThread while the cached buffer keeps being written. That's a use-after-free into memory glibc may have handed back to the JVM, observed in the field as a SIGSEGV in ThreadsSMRSupport::free_list under a Loom workload on Corretto 26.

CarrierThreadLocal.get()/set()/remove() operate on the current carrier's map even from a mounted virtual thread, so a vthread always resolves to its current carrier's live record. Storage lifetime then matches the native record's lifetime, structurally eliminating the dangling-buffer window, and writes land where the sampler reads.

Additional Notes:

  • Java 8 baseline / unnamed module. The lib compiles at --release 8, so the internal type can't be named at compile time. It's built reflectively once and held as its ThreadLocal supertype, so calls dispatch virtually with no per-call reflection and no compile-time dependency.
  • Mode selection via -Dddprof.context.storage.mode=auto|carrier|thread (default auto):
    • auto — carrier scoping when available, else fall back to plain ThreadLocal. Fallback is silent on JDK < 21 (expected) and a loud WARN on JDK 21+ (where it's the unsafe pre-fix behavior).
    • carrier — carrier scoping is required: init fails hard (throws) if CarrierThreadLocal is inaccessible, rather than silently re-exposing the UAF. This is the fail-fast callers opt into once the runtime export is guaranteed.
    • thread — force legacy thread-scoped storage (kill-switch).
  • Why auto degrades instead of failing hard by default: at init we only know the JVM is Loom-capable, not whether virtual threads actually route context here, and the runtime export carrier scoping needs is granted by the agent — which may load after this library. Failing hard by default would break profiler startup for every JDK 21+ deployment (including non-Loom apps) until that grant lands. So auto stays safe-and-loud; the tracer opts into carrier once it guarantees the export.
  • Not in this PR (follow-ups):
    1. dd-trace-java grants --add-exports java.base/jdk.internal.misc to the profiler loader (via Instrumentation.redefineModule) and sets mode=carrier so CARRIER engages and fails fast in production.
    2. Reapply-on-mount / clear-on-unmount coherence (#11646) — this change makes that reapply safe (it writes to the live current carrier) but doesn't implement it.
  • Sharing one ThreadContext per carrier is safe: its only non-buffer mutable state is the attribute cache, which is value-validated (value.equals(attrCacheKeys[slot])) with process-global encodings, so cross-vthread sharing costs at most a cache miss, never wrong data.

How to test the change?:

Automated (green on JDK 21 with carrier scoping active):

  • CarrierContextStorageTest — 2000 virtual threads resolve to ~one ThreadContext per carrier (distinct-context count ≪ vthread count), proving storage is not keyed by the virtual thread. Skips when carrier mode isn't active.
  • OtelContextStorageTest — kill-switch forces plain ThreadLocal; auto/carrier select carrier when available; carrier throws when unavailable (fail-fast, runs on JDK < 21 CI).
  • Existing OtelContextStorageModeTest passes unchanged (platform-thread isolation preserved).

The build adds --add-exports java.base/jdk.internal.misc=ALL-UNNAMED only on JDK ≥ 21 (the flag aborts a Java 8 JVM). The version is checked against the actual test JVM at task-execution time, so the split-JDK musl matrix (build JDK 21, test JDK 8) resolves correctly.

For Datadog employees:

  • This PR doesn't touch any of that.
  • JIRA: PROF-15271

🤖 Generated with Claude Code

…(PROF-15271)

The ThreadContext DirectByteBuffer is a window into the OTEP record embedded in
the *carrier's* native ProfiledThread — the record the (carrier-bound) sampler
reads. It was cached in a plain ThreadLocal, keying it by the *virtual* thread
and pinning it to whichever carrier was mounted at first use. That is wrong once
the vthread migrates (writes land on the old carrier, so a sampler on the new
carrier sees stale/empty context) and unsafe once the old carrier's OS thread
exits: the record is freed while the buffer keeps being written — a use-after-free
that can corrupt JVM-owned native memory (observed as a crash in
ThreadsSMRSupport::free_list).

Introduce a context-storage mode. In CARRIER mode, storage is backed by
jdk.internal.misc.CarrierThreadLocal (JDK 21+), whose get()/set()/remove()
operate on the current carrier's map even from a mounted virtual thread, so a
vthread always resolves to its current carrier's live record. Storage lifetime
then matches the native record's lifetime, eliminating the dangling-buffer window.

- OtelContextStorage: factory + Mode enum + kill-switch. The internal type is
  built reflectively and held as its ThreadLocal supertype, so calls dispatch
  virtually with no per-call reflection and no compile-time dependency (the lib
  is a Java 8 baseline). Degrades to a plain ThreadLocal — today's behavior —
  when the type is missing (older JDK) or inaccessible (export not granted),
  never failing hard. Selected via -Dddprof.context.storage.mode=auto|carrier|thread.
- JavaProfiler: field uses the factory; a currentContext() get-or-init helper
  replaces ThreadLocal.withInitial (a reflectively-built CarrierThreadLocal cannot
  carry a supplier); the 8 context write sites route through it; contextStorageMode()
  added for diagnostics/tests.
- Build: add --add-exports java.base/jdk.internal.misc=ALL-UNNAMED to the test JVM,
  gated on testJvmMajorVersion() >= 21 (the flag aborts a Java 8 JVM). Production
  grants the export from the agent (follow-up).

Tests:
- CarrierContextStorageTest: 2000 vthreads resolve to exactly one ThreadContext
  per carrier (far fewer than the vthread count).
- OtelContextStorageTest: kill-switch forces plain ThreadLocal; auto selects
  carrier when available; explicit carrier falls back without throwing.
- context_uaf_ut.cpp: ASan proof of the unguarded write-after-free primitive.
- Existing OtelContextStorageModeTest unregressed.

Follow-ups: export grant in dd-trace-java; reapply-on-mount / clear-on-unmount
coherence (PROF / #11646), which this change makes safe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@datadog-datadog-prod-us1

This comment has been minimized.

…available

Silently falling back to thread-scoped storage on a JDK 21+ JVM re-exposes the
exact virtual-thread context use-after-free this change removes. Make the policy
explicit at the mode level:

- carrier: carrier scoping is required — create() now throws if
  jdk.internal.misc.CarrierThreadLocal is not accessible, with an actionable
  message (add the --add-exports, or opt into mode=thread). This is the fail-fast
  callers get once the runtime export is guaranteed.
- auto (default): still degrades so profiling loads even without the export
  (the export may be granted by an agent that loads after this library, and the
  fallback is safe for non-Loom apps), but now logs a loud WARN on JDK 21+ where
  the fallback is unsafe under virtual threads. Silent on JDK < 21 (expected).
- thread: unchanged (legacy, silent).

Default stays graceful because at init we only know the JVM is Loom-capable, not
whether virtual threads route context here; failing hard by default would break
profiler startup for every JDK 21+ deployment until the agent-side export grant
lands. The tracer opts into mode=carrier once it guarantees the export.

Tests: carrierModeThrowsWhenUnavailable (fail-fast) and carrierModeUsesCarrierWhenAvailable
replace the previous graceful-fallback-on-carrier test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dd-octo-sts

dd-octo-sts Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

CI Test Results

Run: #28574331429 | Commit: ac16de0 | Duration: 13m 30s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-07-02 08:09:18 UTC

@dd-octo-sts

dd-octo-sts Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

Pipeline: https://gitlab.ddbuild.io/DataDog/apm-reliability/benchmarking-platform/-/pipelines/122294518 Commit: 2e39c826e2bbc2dd9dba2e844fcfa522d8f1f8c4

⚠️ Significant outliers

  • 🔴 future-genetic (JDK 21): runtime +11% (2024→2246 ms)
Runtime details (per benchmark × JDK)
Benchmark JDK Latest Dev Δ (dev vs latest) Issues L/D
akka-uct 21 ✅ 10480 ms (7 iters) ✅ 10112 ms (7 iters) ≈ -3.5% (±19.6%) — / —
akka-uct 25 ✅ 8867 ms (8 iters) ✅ 8822 ms (8 iters) ≈ -0.5% (±16.3%) — / —
finagle-chirper 21 ✅ 6008 ms (11 iters) ✅ 6049 ms (11 iters) ≈ +0.7% (±45.4%) ⚠️ W:1 / ⚠️ W:1
finagle-chirper 25 ✅ 5525 ms (12 iters) ✅ 5397 ms (12 iters) ≈ -2.3% (±40.4%) ⚠️ W:1 / ⚠️ W:1
fj-kmeans 21 ✅ 2810 ms (22 iters) ✅ 2818 ms (22 iters) ≈ +0.3% (±4.2%) — / —
fj-kmeans 25 ✅ 2827 ms (22 iters) ✅ 2857 ms (22 iters) ≈ +1.1% (±4.5%) — / —
future-genetic 21 ✅ 2024 ms (31 iters) ✅ 2246 ms (28 iters) 🔴 +11% — / —
future-genetic 25 ✅ 2098 ms (30 iters) ✅ 2009 ms (31 iters) ≈ -4.2% (±4.4%) — / —
naive-bayes 21 ✅ 1266 ms (45 iters) ✅ 1217 ms (47 iters) ≈ -3.9% (±55.4%) — / —
naive-bayes 25 ✅ 1013 ms (56 iters) ✅ 926 ms (61 iters) ≈ -8.6% (±54.1%) — / —
reactors 21 ✅ 15177 ms (5 iters) ✅ 16506 ms (5 iters) ≈ +8.8% (±15.4%) — / —
reactors 25 ✅ 18264 ms (5 iters) ✅ 17246 ms (5 iters) ≈ -5.6% (±7.2%) — / —
Internal counter details (ddprof)

ddprof internal counters, latest / dev (✅ = 0, · = unavailable):

Benchmark JDK Dropped rec Dropped jvmti Dropped trace Skipped WC AGCT fail Unwind fail
akka-uct 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅
akka-uct 25 ✅ / ✅ ✅ / ✅ 2 / 1 2182 / 2399 ✅ / ✅ ✅ / ✅
finagle-chirper 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅
finagle-chirper 25 ✅ / ✅ ✅ / ✅ ✅ / 2 8692 / 8242 ✅ / ✅ ✅ / ✅
fj-kmeans 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅
fj-kmeans 25 ✅ / ✅ ✅ / ✅ 2 / 2 1299 / 1269 ✅ / ✅ ✅ / ✅
future-genetic 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅
future-genetic 25 ✅ / ✅ ✅ / ✅ 6 / 2 2911 / 2858 ✅ / ✅ ✅ / ✅
naive-bayes 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅
naive-bayes 25 ✅ / ✅ ✅ / ✅ 4 / 4 3471 / 3462 ✅ / ✅ ✅ / ✅
reactors 21 ✅ / ✅ ✅ / ✅ 1 / ✅ 1511 / 1733 ✅ / ✅ ✅ / ✅
reactors 25 ✅ / ✅ ✅ / ✅ ✅ / 1 1875 / 1712 ✅ / ✅ ✅ / ✅

rkennke and others added 3 commits July 1, 2026 13:23
…t, harden test & parsing

- Remove ddprof-lib/src/test/cpp/context_uaf_ut.cpp: it proved the native
  write-after-free *primitive*, not this PR's Java carrier-scoping fix (its
  red->green flip referenced a hypothetical native guard). The fix mechanism is
  validated by CarrierContextStorageTest; a deterministic end-to-end UAF test
  isn't feasible, so the danger stays documented in the PR/commit rationale.
- CarrierContextStorageTest: fix the stale "two touches with a park" comment to
  match the single-touch reality, and relax the per-carrier assertion from
  ==1 to >=1. The map is keyed by carrier *name*, so a retired-and-replaced
  ForkJoinPool worker reusing a slot name could legitimately expose a second
  context; the did-NOT-key-by-vthread guarantee is now carried by the aggregate
  bound (context count well below vthread count), which is name-stability-free.
- OtelContextStorage: parse the mode property with toLowerCase(Locale.ROOT) so
  "CARRIER" is not mangled to "carrıer" under tr_TR (which would silently drop
  the fail-fast semantics).
- JavaProfiler.getThreadContext(): document that the returned ThreadContext must
  not be cached across a possible carrier migration (unmount/remount) — the
  buffer targets the carrier mounted at call time; setContext* re-fetch per use.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The musl split-JDK CI (build JDK 21 via JAVA_HOME, test JDK 8 via JAVA_TEST_HOME)
failed with "Unrecognized option: --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED"
on the JDK 8 test JVM.

Root cause: the flag was gated on PlatformUtils.testJvmMajorVersion() inside
ProfilerTestExtension.init{}, i.e. at Gradle *configuration* time. JAVA_TEST_HOME
is not resolvable then (that is precisely why the task executables are assigned in
doFirst, "so environment variables are read correctly"), so testJavaHome() fell back
to the build JDK (JAVA_HOME=21) and added a JDK-21-only flag that then aborted the
JDK-8 launcher.

Move the gate to task execution time: a carrierExportJvmArgs() helper evaluated in
both test doFirst blocks (glibc Test task and musl Exec task), where the real test
JVM (JAVA_TEST_HOME) is resolvable. When it is < 21 the flag is omitted and the
profiler degrades to thread-scoped storage (carrier tests skip) — safe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous execution-time gate still put --add-exports on the JDK 8 musl test
JVM: PlatformUtils.testJvmMajorVersion() launches `$JAVA_TEST_HOME/bin/java
-version`, and in the split-JDK musl matrix that probe reported >= 21 even though
the resolved executable (and the JVM that then rejected the flag) was JDK 8.

Read the major version from the test JDK's `release` file (JAVA_VERSION=...)
instead of executing the launcher: a pure file read of the same JAVA_TEST_HOME the
executable is resolved from — deterministic, no subprocess, no exec-format/PATH
hazards. Returns 0 when it can't be determined (missing/old `release`), so the flag
is omitted and carrier tests skip — fail-safe, never an abort. Adds a lifecycle log
of the resolved testJavaHome + detected major to make the gate decision visible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rkennke rkennke marked this pull request as ready for review July 1, 2026 12:31
@rkennke rkennke requested a review from a team as a code owner July 1, 2026 12:31
…info

The gate decision was logged at lifecycle level during stabilization to make the
musl/ibm CI behavior visible. The fix is proven (musl-8 green, ibm-8 flaky and
green on rerun), so drop it to info to avoid printing on every test invocation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread ddprof-lib/src/main/java/com/datadoghq/profiler/JavaProfiler.java
Comment thread ddprof-lib/src/main/java/com/datadoghq/profiler/OtelContextStorage.java Outdated
….* property, tidy create()

Addresses Jaroslav's review on PR #625:
- contextStorageMode() returns a typed ContextStorageMode (new public enum) instead
  of a String; the OtelContextStorage factory stays package-private.
- Rename the selector to ddprof.debug.context.storage.mode (ddprof.debug.*, like
  ddprof.debug.malloc_arena_max) to signal it is an internal knob, not supported config.
- create() parses the selector once into locals and documents that it runs once per
  JavaProfiler instance (the tlsContextStorage field initializer), not per thread —
  clarifying the review's per-thread concern.

Preserves the prior commits' work (Locale.ROOT parsing, getThreadContext caching note,
build-side --add-exports gating, removed motivation-only cpp test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dd-octo-sts

dd-octo-sts Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Reliability & Chaos Results

All reliability & chaos checks passed Pipeline: https://gitlab.ddbuild.io/DataDog/java-profiler/-/pipelines/122165041

@jbachorik jbachorik left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This was a sneaky one but it is nicely fixed.
LGTM!

@rkennke rkennke merged commit 94686da into main Jul 2, 2026
105 checks passed
@rkennke rkennke deleted the rkennke/PROF-15271 branch July 2, 2026 08:29
@github-actions github-actions Bot added this to the 1.46.0 milestone Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants