fix(context): scope OTEL ThreadContext storage to the carrier thread (PROF-15271)#625
Merged
Conversation
…(PROF-15271) The ThreadContext DirectByteBuffer is a window into the OTEP record embedded in the *carrier's* native ProfiledThread — the record the (carrier-bound) sampler reads. It was cached in a plain ThreadLocal, keying it by the *virtual* thread and pinning it to whichever carrier was mounted at first use. That is wrong once the vthread migrates (writes land on the old carrier, so a sampler on the new carrier sees stale/empty context) and unsafe once the old carrier's OS thread exits: the record is freed while the buffer keeps being written — a use-after-free that can corrupt JVM-owned native memory (observed as a crash in ThreadsSMRSupport::free_list). Introduce a context-storage mode. In CARRIER mode, storage is backed by jdk.internal.misc.CarrierThreadLocal (JDK 21+), whose get()/set()/remove() operate on the current carrier's map even from a mounted virtual thread, so a vthread always resolves to its current carrier's live record. Storage lifetime then matches the native record's lifetime, eliminating the dangling-buffer window. - OtelContextStorage: factory + Mode enum + kill-switch. The internal type is built reflectively and held as its ThreadLocal supertype, so calls dispatch virtually with no per-call reflection and no compile-time dependency (the lib is a Java 8 baseline). Degrades to a plain ThreadLocal — today's behavior — when the type is missing (older JDK) or inaccessible (export not granted), never failing hard. Selected via -Dddprof.context.storage.mode=auto|carrier|thread. - JavaProfiler: field uses the factory; a currentContext() get-or-init helper replaces ThreadLocal.withInitial (a reflectively-built CarrierThreadLocal cannot carry a supplier); the 8 context write sites route through it; contextStorageMode() added for diagnostics/tests. - Build: add --add-exports java.base/jdk.internal.misc=ALL-UNNAMED to the test JVM, gated on testJvmMajorVersion() >= 21 (the flag aborts a Java 8 JVM). Production grants the export from the agent (follow-up). Tests: - CarrierContextStorageTest: 2000 vthreads resolve to exactly one ThreadContext per carrier (far fewer than the vthread count). - OtelContextStorageTest: kill-switch forces plain ThreadLocal; auto selects carrier when available; explicit carrier falls back without throwing. - context_uaf_ut.cpp: ASan proof of the unguarded write-after-free primitive. - Existing OtelContextStorageModeTest unregressed. Follow-ups: export grant in dd-trace-java; reapply-on-mount / clear-on-unmount coherence (PROF / #11646), which this change makes safe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
…available Silently falling back to thread-scoped storage on a JDK 21+ JVM re-exposes the exact virtual-thread context use-after-free this change removes. Make the policy explicit at the mode level: - carrier: carrier scoping is required — create() now throws if jdk.internal.misc.CarrierThreadLocal is not accessible, with an actionable message (add the --add-exports, or opt into mode=thread). This is the fail-fast callers get once the runtime export is guaranteed. - auto (default): still degrades so profiling loads even without the export (the export may be granted by an agent that loads after this library, and the fallback is safe for non-Loom apps), but now logs a loud WARN on JDK 21+ where the fallback is unsafe under virtual threads. Silent on JDK < 21 (expected). - thread: unchanged (legacy, silent). Default stays graceful because at init we only know the JVM is Loom-capable, not whether virtual threads route context here; failing hard by default would break profiler startup for every JDK 21+ deployment until the agent-side export grant lands. The tracer opts into mode=carrier once it guarantees the export. Tests: carrierModeThrowsWhenUnavailable (fail-fast) and carrierModeUsesCarrierWhenAvailable replace the previous graceful-fallback-on-carrier test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
CI Test ResultsRun: #28574331429 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-07-02 08:09:18 UTC |
Contributor
Benchmark ResultsPipeline: https://gitlab.ddbuild.io/DataDog/apm-reliability/benchmarking-platform/-/pipelines/122294518 Commit:
|
| Benchmark | JDK | Latest | Dev | Δ (dev vs latest) | Issues L/D |
|---|---|---|---|---|---|
| akka-uct | 21 | ✅ 10480 ms (7 iters) | ✅ 10112 ms (7 iters) | ≈ -3.5% (±19.6%) | — / — |
| akka-uct | 25 | ✅ 8867 ms (8 iters) | ✅ 8822 ms (8 iters) | ≈ -0.5% (±16.3%) | — / — |
| finagle-chirper | 21 | ✅ 6008 ms (11 iters) | ✅ 6049 ms (11 iters) | ≈ +0.7% (±45.4%) | |
| finagle-chirper | 25 | ✅ 5525 ms (12 iters) | ✅ 5397 ms (12 iters) | ≈ -2.3% (±40.4%) | |
| fj-kmeans | 21 | ✅ 2810 ms (22 iters) | ✅ 2818 ms (22 iters) | ≈ +0.3% (±4.2%) | — / — |
| fj-kmeans | 25 | ✅ 2827 ms (22 iters) | ✅ 2857 ms (22 iters) | ≈ +1.1% (±4.5%) | — / — |
| future-genetic | 21 | ✅ 2024 ms (31 iters) | ✅ 2246 ms (28 iters) | 🔴 +11% | — / — |
| future-genetic | 25 | ✅ 2098 ms (30 iters) | ✅ 2009 ms (31 iters) | ≈ -4.2% (±4.4%) | — / — |
| naive-bayes | 21 | ✅ 1266 ms (45 iters) | ✅ 1217 ms (47 iters) | ≈ -3.9% (±55.4%) | — / — |
| naive-bayes | 25 | ✅ 1013 ms (56 iters) | ✅ 926 ms (61 iters) | ≈ -8.6% (±54.1%) | — / — |
| reactors | 21 | ✅ 15177 ms (5 iters) | ✅ 16506 ms (5 iters) | ≈ +8.8% (±15.4%) | — / — |
| reactors | 25 | ✅ 18264 ms (5 iters) | ✅ 17246 ms (5 iters) | ≈ -5.6% (±7.2%) | — / — |
Internal counter details (ddprof)
ddprof internal counters, latest / dev (✅ = 0, · = unavailable):
| Benchmark | JDK | Dropped rec | Dropped jvmti | Dropped trace | Skipped WC | AGCT fail | Unwind fail |
|---|---|---|---|---|---|---|---|
| akka-uct | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ |
| akka-uct | 25 | ✅ / ✅ | ✅ / ✅ | 2 / 1 | 2182 / 2399 | ✅ / ✅ | ✅ / ✅ |
| finagle-chirper | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ |
| finagle-chirper | 25 | ✅ / ✅ | ✅ / ✅ | ✅ / 2 | 8692 / 8242 | ✅ / ✅ | ✅ / ✅ |
| fj-kmeans | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ |
| fj-kmeans | 25 | ✅ / ✅ | ✅ / ✅ | 2 / 2 | 1299 / 1269 | ✅ / ✅ | ✅ / ✅ |
| future-genetic | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ |
| future-genetic | 25 | ✅ / ✅ | ✅ / ✅ | 6 / 2 | 2911 / 2858 | ✅ / ✅ | ✅ / ✅ |
| naive-bayes | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ |
| naive-bayes | 25 | ✅ / ✅ | ✅ / ✅ | 4 / 4 | 3471 / 3462 | ✅ / ✅ | ✅ / ✅ |
| reactors | 21 | ✅ / ✅ | ✅ / ✅ | 1 / ✅ | 1511 / 1733 | ✅ / ✅ | ✅ / ✅ |
| reactors | 25 | ✅ / ✅ | ✅ / ✅ | ✅ / 1 | 1875 / 1712 | ✅ / ✅ | ✅ / ✅ |
…t, harden test & parsing - Remove ddprof-lib/src/test/cpp/context_uaf_ut.cpp: it proved the native write-after-free *primitive*, not this PR's Java carrier-scoping fix (its red->green flip referenced a hypothetical native guard). The fix mechanism is validated by CarrierContextStorageTest; a deterministic end-to-end UAF test isn't feasible, so the danger stays documented in the PR/commit rationale. - CarrierContextStorageTest: fix the stale "two touches with a park" comment to match the single-touch reality, and relax the per-carrier assertion from ==1 to >=1. The map is keyed by carrier *name*, so a retired-and-replaced ForkJoinPool worker reusing a slot name could legitimately expose a second context; the did-NOT-key-by-vthread guarantee is now carried by the aggregate bound (context count well below vthread count), which is name-stability-free. - OtelContextStorage: parse the mode property with toLowerCase(Locale.ROOT) so "CARRIER" is not mangled to "carrıer" under tr_TR (which would silently drop the fail-fast semantics). - JavaProfiler.getThreadContext(): document that the returned ThreadContext must not be cached across a possible carrier migration (unmount/remount) — the buffer targets the carrier mounted at call time; setContext* re-fetch per use. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The musl split-JDK CI (build JDK 21 via JAVA_HOME, test JDK 8 via JAVA_TEST_HOME)
failed with "Unrecognized option: --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED"
on the JDK 8 test JVM.
Root cause: the flag was gated on PlatformUtils.testJvmMajorVersion() inside
ProfilerTestExtension.init{}, i.e. at Gradle *configuration* time. JAVA_TEST_HOME
is not resolvable then (that is precisely why the task executables are assigned in
doFirst, "so environment variables are read correctly"), so testJavaHome() fell back
to the build JDK (JAVA_HOME=21) and added a JDK-21-only flag that then aborted the
JDK-8 launcher.
Move the gate to task execution time: a carrierExportJvmArgs() helper evaluated in
both test doFirst blocks (glibc Test task and musl Exec task), where the real test
JVM (JAVA_TEST_HOME) is resolvable. When it is < 21 the flag is omitted and the
profiler degrades to thread-scoped storage (carrier tests skip) — safe.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous execution-time gate still put --add-exports on the JDK 8 musl test JVM: PlatformUtils.testJvmMajorVersion() launches `$JAVA_TEST_HOME/bin/java -version`, and in the split-JDK musl matrix that probe reported >= 21 even though the resolved executable (and the JVM that then rejected the flag) was JDK 8. Read the major version from the test JDK's `release` file (JAVA_VERSION=...) instead of executing the launcher: a pure file read of the same JAVA_TEST_HOME the executable is resolved from — deterministic, no subprocess, no exec-format/PATH hazards. Returns 0 when it can't be determined (missing/old `release`), so the flag is omitted and carrier tests skip — fail-safe, never an abort. Adds a lifecycle log of the resolved testJavaHome + detected major to make the gate decision visible. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…info The gate decision was logged at lifecycle level during stabilization to make the musl/ibm CI behavior visible. The fix is proven (musl-8 green, ibm-8 flaky and green on rerun), so drop it to info to avoid printing on every test invocation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jbachorik
reviewed
Jul 1, 2026
….* property, tidy create() Addresses Jaroslav's review on PR #625: - contextStorageMode() returns a typed ContextStorageMode (new public enum) instead of a String; the OtelContextStorage factory stays package-private. - Rename the selector to ddprof.debug.context.storage.mode (ddprof.debug.*, like ddprof.debug.malloc_arena_max) to signal it is an internal knob, not supported config. - create() parses the selector once into locals and documents that it runs once per JavaProfiler instance (the tlsContextStorage field initializer), not per thread — clarifying the review's per-thread concern. Preserves the prior commits' work (Locale.ROOT parsing, getThreadContext caching note, build-side --add-exports gating, removed motivation-only cpp test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Reliability & Chaos Results✅ All reliability & chaos checks passed Pipeline: https://gitlab.ddbuild.io/DataDog/java-profiler/-/pipelines/122165041 |
jbachorik
approved these changes
Jul 2, 2026
jbachorik
left a comment
Collaborator
There was a problem hiding this comment.
Thanks! This was a sneaky one but it is nicely fixed.
LGTM!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?:
Scopes OTEL
ThreadContextstorage to the carrier thread when running under virtual threads, replacing the plainThreadLocalthat keyed it by the virtual thread. Introduced as a selectable mode (OtelContextStorage.Mode.CARRIER) backed byjdk.internal.misc.CarrierThreadLocal, with a kill-switch and a safe fallback to today's behavior.Motivation:
The
ThreadContext'sDirectByteBufferis a window into the OTEP record embedded in the carrier's nativeProfiledThread— the record the (carrier-bound) sampler reads. Caching it in a plainThreadLocalkeys it by the virtual thread and pins it to whichever carrier was mounted at first use. This is:freeKey/release()frees theProfiledThreadwhile the cached buffer keeps being written. That's a use-after-free into memory glibc may have handed back to the JVM, observed in the field as a SIGSEGV inThreadsSMRSupport::free_listunder a Loom workload on Corretto 26.CarrierThreadLocal.get()/set()/remove()operate on the current carrier's map even from a mounted virtual thread, so a vthread always resolves to its current carrier's live record. Storage lifetime then matches the native record's lifetime, structurally eliminating the dangling-buffer window, and writes land where the sampler reads.Additional Notes:
--release 8, so the internal type can't be named at compile time. It's built reflectively once and held as itsThreadLocalsupertype, so calls dispatch virtually with no per-call reflection and no compile-time dependency.-Dddprof.context.storage.mode=auto|carrier|thread(defaultauto):auto— carrier scoping when available, else fall back to plainThreadLocal. Fallback is silent on JDK < 21 (expected) and a loud WARN on JDK 21+ (where it's the unsafe pre-fix behavior).carrier— carrier scoping is required: init fails hard (throws) ifCarrierThreadLocalis inaccessible, rather than silently re-exposing the UAF. This is the fail-fast callers opt into once the runtime export is guaranteed.thread— force legacy thread-scoped storage (kill-switch).autodegrades instead of failing hard by default: at init we only know the JVM is Loom-capable, not whether virtual threads actually route context here, and the runtime export carrier scoping needs is granted by the agent — which may load after this library. Failing hard by default would break profiler startup for every JDK 21+ deployment (including non-Loom apps) until that grant lands. Soautostays safe-and-loud; the tracer opts intocarrieronce it guarantees the export.--add-exports java.base/jdk.internal.miscto the profiler loader (viaInstrumentation.redefineModule) and setsmode=carriersoCARRIERengages and fails fast in production.ThreadContextper carrier is safe: its only non-buffer mutable state is the attribute cache, which is value-validated (value.equals(attrCacheKeys[slot])) with process-global encodings, so cross-vthread sharing costs at most a cache miss, never wrong data.How to test the change?:
Automated (green on JDK 21 with carrier scoping active):
CarrierContextStorageTest— 2000 virtual threads resolve to ~oneThreadContextper carrier (distinct-context count ≪ vthread count), proving storage is not keyed by the virtual thread. Skips when carrier mode isn't active.OtelContextStorageTest— kill-switch forces plainThreadLocal;auto/carrierselect carrier when available;carrierthrows when unavailable (fail-fast, runs on JDK < 21 CI).OtelContextStorageModeTestpasses unchanged (platform-thread isolation preserved).The build adds
--add-exports java.base/jdk.internal.misc=ALL-UNNAMEDonly on JDK ≥ 21 (the flag aborts a Java 8 JVM). The version is checked against the actual test JVM at task-execution time, so the split-JDK musl matrix (build JDK 21, test JDK 8) resolves correctly.For Datadog employees:
🤖 Generated with Claude Code