TLS priming optimization #303

jbachorik · 2025-11-27T22:39:49Z

What does this PR do?:

Optimizes TLS (Thread-Local Storage) priming by introducing a lock-free bitset to track Java/JVM threads. This allows the profiler to skip sending TLS priming signals to Java threads (which are already initialized via JVMTI callbacks), reducing overhead by ~95% for typical Java workloads.

Key changes:

New LockFreeBitset template class (lockFree.h): Reusable double-hashing bitset for concurrent membership tracking
- Uses two independent hash functions to minimize false positives (probability ≈ (M/N)²)
- Interleaved memory layout for L1 cache locality
- Lock-free atomic operations safe for signal handlers
- 32 KB memory footprint (fits entirely in L1 cache)
Java thread tracking: Register Java threads in the bitset via JVMTI callbacks, skip TLS priming signals for them
Thread watcher optimization: 20ms delay after thread creation allows JVMTI registration before checking the bitset

Motivation:

Before this optimization, the TLS priming system sent signals to ALL threads (Java + native) when new threads were detected. Since Java threads are already initialized via JVMTI ThreadStart callbacks, these signals were wasteful.

Additionally, before TLS priming was implemented, native threads created after profiling started were completely invisible to the profiler. With this change:

The vast majority of native threads (99.6-99.997%) are correctly primed and profiled
Only a tiny fraction might be skipped due to false positives in the Java thread filter
This is dramatically better than 100% of late-created native threads being invisible

Additional Notes:

Configuration: TLS priming can be controlled via DD_PROFILER_TLS_WATCHER environment variable or JVM system property (set to 0 to disable, 1 to enable - default enabled)
False positive analysis: Double-hashing reduces false positives from p to p². With 16384 bits per array:
- 100 threads → 0.003% false positive rate
- 500 threads → 0.09% false positive rate
- 1000 threads → 0.37% false positive rate
Memory layout: Interleaved array [word1_0, word2_0, word1_1, word2_1, ...] ensures both hash lookups access adjacent cache lines

How to test the change?:

Unit tests: New gtest suites added:
- test_lockFreeBitset.cpp: 15 tests for the generic bitset implementation
- test_javaThreadBitset.cpp: 9 tests for Java thread tracking
JMH benchmark: ThreadChurnBenchmark.java measures throughput impact under high thread churn
```
./gradlew :ddprof-stresstest:jmh -PjmhInclude="ThreadChurnBenchmark"
```
All existing tests pass: 113 gtest tests, full test suite

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.
JIRA: PROF-13170

Unsure? Have a question? Request a review!

pr-commenter · 2025-11-28T00:27:24Z

Benchmarks [x86_64 cpu]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	off	off
cpu	on	on
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	off	off
modes	cpu	cpu
wall	off	off

Summary

Found 0 performance improvements and 3 performance regressions! Performance is the same for 11 metrics, 24 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:fj-kmeans	worse [+561.534ms; +694.466ms] or [+2.413%; +2.985%]	unstable [-247.733MB; +356.909MB] or [-23.498%; +33.853%]
scenario:renaissance:scala-kmeans	worse [+444.034ms; +1267.966ms] or [+1.930%; +5.511%]	unstable [-227.981MB; +341.255MB] or [-23.008%; +34.440%]
scenario:renaissance:gauss-mix	worse [+668.877ms; +907.123ms] or [+3.683%; +4.995%]	unstable [-400.040MB; +505.662MB] or [-33.459%; +42.293%]

pr-commenter · 2025-11-28T00:27:34Z

Benchmarks [x86_64 memleak,alloc]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	on	on
cpu	off	off
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	on	on
modes	memleak,alloc	memleak,alloc
wall	off	off

Summary

Found 0 performance improvements and 6 performance regressions! Performance is the same for 8 metrics, 24 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:finagle-http	worse [+672.087ms; +1127.913ms] or [+2.542%; +4.265%]	unstable [-266.589MB; +374.100MB] or [-19.388%; +27.207%]
scenario:renaissance:future-genetic	worse [+379.760ms; +704.240ms] or [+2.329%; +4.319%]	unstable [-307.831MB; +422.794MB] or [-31.442%; +43.184%]
scenario:renaissance:fj-kmeans	worse [+402.827ms; +581.173ms] or [+1.721%; +2.483%]	unstable [-250.891MB; +353.262MB] or [-23.781%; +33.484%]
scenario:renaissance:naive-bayes	worse [+263.441ms; +404.559ms] or [+1.801%; +2.765%]	unstable [-476.428MB; +551.872MB] or [-49.032%; +56.796%]
scenario:renaissance:scala-kmeans	worse [+0.667s; +1.489s] or [+2.902%; +6.480%]	unstable [-227.016MB; +342.822MB] or [-22.903%; +34.586%]
scenario:renaissance:log-regression	worse [+843.043ms; +1128.957ms] or [+1.649%; +2.209%]	unstable [-145.723MB; +286.859MB] or [-8.680%; +17.087%]

pr-commenter · 2025-11-28T00:28:02Z

Benchmarks [x86_64 memleak]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	off	off
cpu	off	off
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	on	on
modes	memleak	memleak
wall	off	off

Summary

Found 0 performance improvements and 4 performance regressions! Performance is the same for 11 metrics, 23 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:future-genetic	worse [+743.881ms; +980.119ms] or [+4.654%; +6.132%]	unstable [-307.015MB; +422.886MB] or [-31.384%; +43.229%]
scenario:renaissance:fj-kmeans	worse [+538.253ms; +645.747ms] or [+2.310%; +2.771%]	unstable [-245.474MB; +359.882MB] or [-23.283%; +34.135%]
scenario:renaissance:scala-kmeans	worse [+400.987ms; +987.013ms] or [+1.731%; +4.261%]	unstable [-229.473MB; +340.723MB] or [-23.109%; +34.312%]
scenario:renaissance:gauss-mix	worse [+691.838ms; +844.162ms] or [+3.804%; +4.641%]	unstable [-393.625MB; +508.576MB] or [-33.117%; +42.788%]

pr-commenter · 2025-11-28T00:28:45Z

Benchmarks [aarch64 alloc]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	on	on
cpu	off	off
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	off	off
modes	alloc	alloc
wall	off	off

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:future-genetic	worse [+500.577ms; +715.423ms] or [+3.335%; +4.766%]	unstable [-261.053MB; +526.533MB] or [-29.773%; +60.051%]
scenario:renaissance:naive-bayes	worse [+411.993ms; +944.007ms] or [+2.737%; +6.271%]	unstable [-282.696MB; +629.571MB] or [-28.349%; +63.133%]

pr-commenter · 2025-11-28T00:28:50Z

Benchmarks [aarch64 cpu]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	off	off
cpu	on	on
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	off	off
modes	cpu	cpu
wall	off	off

Summary

Found 0 performance improvements and 1 performance regressions! Performance is the same for 17 metrics, 20 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:future-genetic	worse [+534.235ms; +745.765ms] or [+3.565%; +4.977%]	unstable [-244.253MB; +569.374MB] or [-28.673%; +66.839%]

pr-commenter · 2025-11-28T00:29:26Z

Benchmarks [aarch64 wall]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	off	off
cpu	off	off
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	off	off
modes	wall	wall
wall	on	on

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:future-genetic	worse [+378.635ms; +649.365ms] or [+2.510%; +4.305%]	unstable [-277.642MB; +485.866MB] or [-30.891%; +54.058%]
scenario:renaissance:dec-tree	worse [+478.667ms; +641.333ms] or [+1.589%; +2.130%]	unstable [-187.807MB; +427.317MB] or [-13.882%; +31.586%]

pr-commenter · 2025-11-28T00:29:58Z

Benchmarks [aarch64 cpu,wall]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	off	off
cpu	on	on
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	off	off
modes	cpu,wall	cpu,wall
wall	on	on

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:future-genetic	worse [+359.307ms; +648.693ms] or [+2.378%; +4.293%]	unstable [-275.754MB; +490.823MB] or [-30.598%; +54.463%]
scenario:renaissance:akka-uct	worse [+0.534s; +1.746s] or [+1.796%; +5.867%]	unstable [-166.876MB; +354.764MB] or [-13.697%; +29.118%]

pr-commenter · 2025-11-28T00:30:03Z

Benchmarks [x86_64 wall]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	off	off
cpu	off	off
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	off	off
modes	wall	wall
wall	on	on

Summary

Found 0 performance improvements and 3 performance regressions! Performance is the same for 12 metrics, 23 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:fj-kmeans	worse [+494.858ms; +609.142ms] or [+2.125%; +2.615%]	unstable [-242.513MB; +356.522MB] or [-23.250%; +34.180%]
scenario:renaissance:scala-kmeans	worse [+521.421ms; +1058.579ms] or [+2.262%; +4.592%]	unstable [-222.671MB; +343.058MB] or [-22.658%; +34.908%]
scenario:renaissance:gauss-mix	worse [+827.330ms; +940.670ms] or [+4.590%; +5.218%]	unstable [-400.189MB; +503.990MB] or [-33.525%; +42.221%]

pr-commenter · 2025-11-28T00:30:04Z

Benchmarks [aarch64 cpu,wall,alloc,memleak]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	on	on
cpu	on	on
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	on	on
modes	cpu,wall,alloc,memleak	cpu,wall,alloc,memleak
wall	on	on

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:future-genetic	worse [+427.182ms; +824.818ms] or [+2.850%; +5.503%]	unstable [-261.216MB; +527.564MB] or [-29.750%; +60.084%]
scenario:renaissance:naive-bayes	worse [+321.143ms; +1202.857ms] or [+2.174%; +8.144%]	unstable [-290.387MB; +663.367MB] or [-30.172%; +68.925%]

pr-commenter · 2025-11-28T00:30:08Z

Benchmarks [aarch64 memleak]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	off	off
cpu	off	off
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	on	on
modes	memleak	memleak
wall	off	off

Summary

Found 0 performance improvements and 1 performance regressions! Performance is the same for 17 metrics, 20 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:future-genetic	worse [+465.400ms; +782.600ms] or [+3.105%; +5.222%]	unstable [-264.729MB; +523.370MB] or [-30.125%; +59.557%]

pr-commenter · 2025-11-28T00:30:41Z

Benchmarks [aarch64 memleak,alloc]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	on	on
cpu	off	off
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	on	on
modes	memleak,alloc	memleak,alloc
wall	off	off

Summary

Found 0 performance improvements and 4 performance regressions! Performance is the same for 14 metrics, 20 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:finagle-http	worse [+496.076ms; +1375.924ms] or [+1.551%; +4.302%]	unstable [-213.271MB; +332.203MB] or [-15.453%; +24.070%]
scenario:renaissance:future-genetic	worse [+534.232ms; +629.768ms] or [+3.552%; +4.187%]	unstable [-249.002MB; +565.070MB] or [-29.156%; +66.166%]
scenario:renaissance:chi-square	worse [+315.472ms; +1328.528ms] or [+2.010%; +8.466%]	unstable [-347.106MB; +507.162MB] or [-31.958%; +46.695%]
scenario:renaissance:fj-kmeans	worse [+441.014ms; +1422.986ms] or [+2.085%; +6.729%]	unstable [-240.888MB; +356.583MB] or [-23.255%; +34.424%]

pr-commenter · 2025-11-28T00:30:43Z

Benchmarks [x86_64 alloc]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	on	on
cpu	off	off
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	off	off
modes	alloc	alloc
wall	off	off

Summary

Found 0 performance improvements and 3 performance regressions! Performance is the same for 13 metrics, 22 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:fj-kmeans	worse [+463.851ms; +516.149ms] or [+1.982%; +2.205%]	unstable [-244.001MB; +362.729MB] or [-23.116%; +34.363%]
scenario:renaissance:log-regression	worse [+0.887s; +1.197s] or [+1.738%; +2.346%]	unstable [-154.407MB; +283.670MB] or [-9.099%; +16.717%]
scenario:renaissance:gauss-mix	worse [+752.385ms; +919.615ms] or [+4.153%; +5.076%]	unstable [-399.088MB; +506.124MB] or [-33.431%; +42.397%]

pr-commenter · 2025-11-28T00:30:45Z

Benchmarks [x86_64 cpu,wall]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	off	off
cpu	on	on
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	off	off
modes	cpu,wall	cpu,wall
wall	on	on

Summary

Found 0 performance improvements and 4 performance regressions! Performance is the same for 10 metrics, 24 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:page-rank	worse [+0.787s; +1.293s] or [+1.607%; +2.642%]	unstable [-115.321MB; +300.006MB] or [-7.894%; +20.535%]
scenario:renaissance:future-genetic	worse [+808.771ms; +1015.229ms] or [+5.074%; +6.370%]	unstable [-308.367MB; +419.197MB] or [-31.591%; +42.945%]
scenario:renaissance:fj-kmeans	worse [+519.702ms; +624.298ms] or [+2.228%; +2.677%]	unstable [-244.289MB; +361.022MB] or [-23.185%; +34.264%]
scenario:renaissance:gauss-mix	worse [+739.054ms; +868.946ms] or [+4.072%; +4.788%]	unstable [-391.916MB; +512.825MB] or [-32.918%; +43.074%]

pr-commenter · 2025-11-28T00:30:57Z

Benchmarks [x86_64 cpu,wall,alloc,memleak]

Parameters

	Baseline	Candidate
config	baseline	candidate
ddprof	1.34.3	1.35.0-jb_tls_optim-SNAPSHOT

See matching parameters

	Baseline	Candidate
alloc	on	on
cpu	on	on
iterations	5	5
java	"11.0.28"	"11.0.28"
memleak	on	on
modes	cpu,wall,alloc,memleak	cpu,wall,alloc,memleak
wall	on	on

Summary

Found 1 performance improvements and 2 performance regressions! Performance is the same for 13 metrics, 22 unstable metrics.

scenario	Δ mean execution_time	Δ mean rss
scenario:renaissance:page-rank	worse [+0.918s; +1.462s] or [+1.883%; +2.997%]	unstable [-130.219MB; +279.916MB] or [-8.921%; +19.176%]
scenario:renaissance:fj-kmeans	worse [+439.509ms; +524.491ms] or [+1.877%; +2.240%]	unstable [-255.530MB; +350.300MB] or [-24.107%; +33.047%]
scenario:renaissance:par-mnemonics	better [-2.784s; -0.392s] or [-10.681%; -1.502%]	unstable [-228.933MB; +308.269MB] or [-21.067%; +28.368%]

jbachorik force-pushed the jb/tls_optim branch from c753eda to f9175da Compare November 27, 2025 22:42

jbachorik mentioned this pull request Nov 28, 2025

Always prepare reports #302

Merged

jbachorik force-pushed the jb/tls_optim branch 5 times, most recently from ee450f4 to f2ae229 Compare November 28, 2025 11:05

jbachorik changed the title ~~[WIP] TLS priming optimization~~ TLS priming optimization Nov 28, 2025

jbachorik added 4 commits November 28, 2025 13:36

Prepare reports only if and when they are going to be uploaded

1ffcd63

Optimize the TLS priming with fast tracking of Java/JVM threads

95da72f

Add benchmark for the TLS priming mechanism

507bf09

Update claude instructions

cfe5cd0

jbachorik force-pushed the jb/tls_optim branch from a57ea0c to cfe5cd0 Compare November 28, 2025 12:36

Fix failing javadoc task

9e7a33a

TLS priming optimization #303

Are you sure you want to change the base?

TLS priming optimization #303

Uh oh!

Conversation

jbachorik commented Nov 27, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [x86_64 cpu]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [x86_64 memleak,alloc]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [x86_64 memleak]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [aarch64 alloc]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [aarch64 cpu]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [aarch64 wall]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [aarch64 cpu,wall]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [x86_64 wall]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [aarch64 cpu,wall,alloc,memleak]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [aarch64 memleak]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [aarch64 memleak,alloc]

Parameters

Summary

Uh oh!

pr-commenter bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [x86_64 alloc]

Parameters

Summary

Uh oh!

jbachorik commented Nov 27, 2025 •

edited by atlassian bot

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading

pr-commenter bot commented Nov 28, 2025 •

edited

Loading