Skip to content

Conversation

@jbachorik
Copy link
Collaborator

@jbachorik jbachorik commented Nov 27, 2025

What does this PR do?:

Optimizes TLS (Thread-Local Storage) priming by introducing a lock-free bitset to track Java/JVM threads. This allows the profiler to skip sending TLS priming signals to Java threads (which are already initialized via JVMTI callbacks), reducing overhead by ~95% for typical Java workloads.

Key changes:

  • New LockFreeBitset template class (lockFree.h): Reusable double-hashing bitset for concurrent membership tracking

    • Uses two independent hash functions to minimize false positives (probability ≈ (M/N)²)
    • Interleaved memory layout for L1 cache locality
    • Lock-free atomic operations safe for signal handlers
    • 32 KB memory footprint (fits entirely in L1 cache)
  • Java thread tracking: Register Java threads in the bitset via JVMTI callbacks, skip TLS priming signals for them

  • Thread watcher optimization: 20ms delay after thread creation allows JVMTI registration before checking the bitset

Motivation:

Before this optimization, the TLS priming system sent signals to ALL threads (Java + native) when new threads were detected. Since Java threads are already initialized via JVMTI ThreadStart callbacks, these signals were wasteful.

Additionally, before TLS priming was implemented, native threads created after profiling started were completely invisible to the profiler. With this change:

  • The vast majority of native threads (99.6-99.997%) are correctly primed and profiled
  • Only a tiny fraction might be skipped due to false positives in the Java thread filter
  • This is dramatically better than 100% of late-created native threads being invisible

Additional Notes:

  • Configuration: TLS priming can be controlled via DD_PROFILER_TLS_WATCHER environment variable or JVM system property (set to 0 to disable, 1 to enable - default enabled)
  • False positive analysis: Double-hashing reduces false positives from p to p². With 16384 bits per array:
    • 100 threads → 0.003% false positive rate
    • 500 threads → 0.09% false positive rate
    • 1000 threads → 0.37% false positive rate
  • Memory layout: Interleaved array [word1_0, word2_0, word1_1, word2_1, ...] ensures both hash lookups access adjacent cache lines

How to test the change?:

  1. Unit tests: New gtest suites added:

    • test_lockFreeBitset.cpp: 15 tests for the generic bitset implementation
    • test_javaThreadBitset.cpp: 9 tests for Java thread tracking
  2. JMH benchmark: ThreadChurnBenchmark.java measures throughput impact under high thread churn

    ./gradlew :ddprof-stresstest:jmh -PjmhInclude="ThreadChurnBenchmark"
  3. All existing tests pass: 113 gtest tests, full test suite

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-13170

Unsure? Have a question? Request a review!

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [x86_64 cpu]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu cpu
wall off off

Summary

Found 0 performance improvements and 3 performance regressions! Performance is the same for 11 metrics, 24 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:fj-kmeans worse
[+561.534ms; +694.466ms] or [+2.413%; +2.985%]
unstable
[-247.733MB; +356.909MB] or [-23.498%; +33.853%]
scenario:renaissance:scala-kmeans worse
[+444.034ms; +1267.966ms] or [+1.930%; +5.511%]
unstable
[-227.981MB; +341.255MB] or [-23.008%; +34.440%]
scenario:renaissance:gauss-mix worse
[+668.877ms; +907.123ms] or [+3.683%; +4.995%]
unstable
[-400.040MB; +505.662MB] or [-33.459%; +42.293%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [x86_64 memleak,alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak,alloc memleak,alloc
wall off off

Summary

Found 0 performance improvements and 6 performance regressions! Performance is the same for 8 metrics, 24 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:finagle-http worse
[+672.087ms; +1127.913ms] or [+2.542%; +4.265%]
unstable
[-266.589MB; +374.100MB] or [-19.388%; +27.207%]
scenario:renaissance:future-genetic worse
[+379.760ms; +704.240ms] or [+2.329%; +4.319%]
unstable
[-307.831MB; +422.794MB] or [-31.442%; +43.184%]
scenario:renaissance:fj-kmeans worse
[+402.827ms; +581.173ms] or [+1.721%; +2.483%]
unstable
[-250.891MB; +353.262MB] or [-23.781%; +33.484%]
scenario:renaissance:naive-bayes worse
[+263.441ms; +404.559ms] or [+1.801%; +2.765%]
unstable
[-476.428MB; +551.872MB] or [-49.032%; +56.796%]
scenario:renaissance:scala-kmeans worse
[+0.667s; +1.489s] or [+2.902%; +6.480%]
unstable
[-227.016MB; +342.822MB] or [-22.903%; +34.586%]
scenario:renaissance:log-regression worse
[+843.043ms; +1128.957ms] or [+1.649%; +2.209%]
unstable
[-145.723MB; +286.859MB] or [-8.680%; +17.087%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [x86_64 memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak memleak
wall off off

Summary

Found 0 performance improvements and 4 performance regressions! Performance is the same for 11 metrics, 23 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:future-genetic worse
[+743.881ms; +980.119ms] or [+4.654%; +6.132%]
unstable
[-307.015MB; +422.886MB] or [-31.384%; +43.229%]
scenario:renaissance:fj-kmeans worse
[+538.253ms; +645.747ms] or [+2.310%; +2.771%]
unstable
[-245.474MB; +359.882MB] or [-23.283%; +34.135%]
scenario:renaissance:scala-kmeans worse
[+400.987ms; +987.013ms] or [+1.731%; +4.261%]
unstable
[-229.473MB; +340.723MB] or [-23.109%; +34.312%]
scenario:renaissance:gauss-mix worse
[+691.838ms; +844.162ms] or [+3.804%; +4.641%]
unstable
[-393.625MB; +508.576MB] or [-33.117%; +42.788%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [aarch64 alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes alloc alloc
wall off off

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:future-genetic worse
[+500.577ms; +715.423ms] or [+3.335%; +4.766%]
unstable
[-261.053MB; +526.533MB] or [-29.773%; +60.051%]
scenario:renaissance:naive-bayes worse
[+411.993ms; +944.007ms] or [+2.737%; +6.271%]
unstable
[-282.696MB; +629.571MB] or [-28.349%; +63.133%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [aarch64 cpu]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu cpu
wall off off

Summary

Found 0 performance improvements and 1 performance regressions! Performance is the same for 17 metrics, 20 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:future-genetic worse
[+534.235ms; +745.765ms] or [+3.565%; +4.977%]
unstable
[-244.253MB; +569.374MB] or [-28.673%; +66.839%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [aarch64 wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes wall wall
wall on on

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:future-genetic worse
[+378.635ms; +649.365ms] or [+2.510%; +4.305%]
unstable
[-277.642MB; +485.866MB] or [-30.891%; +54.058%]
scenario:renaissance:dec-tree worse
[+478.667ms; +641.333ms] or [+1.589%; +2.130%]
unstable
[-187.807MB; +427.317MB] or [-13.882%; +31.586%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [aarch64 cpu,wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu,wall cpu,wall
wall on on

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:future-genetic worse
[+359.307ms; +648.693ms] or [+2.378%; +4.293%]
unstable
[-275.754MB; +490.823MB] or [-30.598%; +54.463%]
scenario:renaissance:akka-uct worse
[+0.534s; +1.746s] or [+1.796%; +5.867%]
unstable
[-166.876MB; +354.764MB] or [-13.697%; +29.118%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [x86_64 wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes wall wall
wall on on

Summary

Found 0 performance improvements and 3 performance regressions! Performance is the same for 12 metrics, 23 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:fj-kmeans worse
[+494.858ms; +609.142ms] or [+2.125%; +2.615%]
unstable
[-242.513MB; +356.522MB] or [-23.250%; +34.180%]
scenario:renaissance:scala-kmeans worse
[+521.421ms; +1058.579ms] or [+2.262%; +4.592%]
unstable
[-222.671MB; +343.058MB] or [-22.658%; +34.908%]
scenario:renaissance:gauss-mix worse
[+827.330ms; +940.670ms] or [+4.590%; +5.218%]
unstable
[-400.189MB; +503.990MB] or [-33.525%; +42.221%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [aarch64 cpu,wall,alloc,memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes cpu,wall,alloc,memleak cpu,wall,alloc,memleak
wall on on

Summary

Found 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:future-genetic worse
[+427.182ms; +824.818ms] or [+2.850%; +5.503%]
unstable
[-261.216MB; +527.564MB] or [-29.750%; +60.084%]
scenario:renaissance:naive-bayes worse
[+321.143ms; +1202.857ms] or [+2.174%; +8.144%]
unstable
[-290.387MB; +663.367MB] or [-30.172%; +68.925%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [aarch64 memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak memleak
wall off off

Summary

Found 0 performance improvements and 1 performance regressions! Performance is the same for 17 metrics, 20 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:future-genetic worse
[+465.400ms; +782.600ms] or [+3.105%; +5.222%]
unstable
[-264.729MB; +523.370MB] or [-30.125%; +59.557%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [aarch64 memleak,alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak,alloc memleak,alloc
wall off off

Summary

Found 0 performance improvements and 4 performance regressions! Performance is the same for 14 metrics, 20 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:finagle-http worse
[+496.076ms; +1375.924ms] or [+1.551%; +4.302%]
unstable
[-213.271MB; +332.203MB] or [-15.453%; +24.070%]
scenario:renaissance:future-genetic worse
[+534.232ms; +629.768ms] or [+3.552%; +4.187%]
unstable
[-249.002MB; +565.070MB] or [-29.156%; +66.166%]
scenario:renaissance:chi-square worse
[+315.472ms; +1328.528ms] or [+2.010%; +8.466%]
unstable
[-347.106MB; +507.162MB] or [-31.958%; +46.695%]
scenario:renaissance:fj-kmeans worse
[+441.014ms; +1422.986ms] or [+2.085%; +6.729%]
unstable
[-240.888MB; +356.583MB] or [-23.255%; +34.424%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [x86_64 alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes alloc alloc
wall off off

Summary

Found 0 performance improvements and 3 performance regressions! Performance is the same for 13 metrics, 22 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:fj-kmeans worse
[+463.851ms; +516.149ms] or [+1.982%; +2.205%]
unstable
[-244.001MB; +362.729MB] or [-23.116%; +34.363%]
scenario:renaissance:log-regression worse
[+0.887s; +1.197s] or [+1.738%; +2.346%]
unstable
[-154.407MB; +283.670MB] or [-9.099%; +16.717%]
scenario:renaissance:gauss-mix worse
[+752.385ms; +919.615ms] or [+4.153%; +5.076%]
unstable
[-399.088MB; +506.124MB] or [-33.431%; +42.397%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [x86_64 cpu,wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu,wall cpu,wall
wall on on

Summary

Found 0 performance improvements and 4 performance regressions! Performance is the same for 10 metrics, 24 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:page-rank worse
[+0.787s; +1.293s] or [+1.607%; +2.642%]
unstable
[-115.321MB; +300.006MB] or [-7.894%; +20.535%]
scenario:renaissance:future-genetic worse
[+808.771ms; +1015.229ms] or [+5.074%; +6.370%]
unstable
[-308.367MB; +419.197MB] or [-31.591%; +42.945%]
scenario:renaissance:fj-kmeans worse
[+519.702ms; +624.298ms] or [+2.228%; +2.677%]
unstable
[-244.289MB; +361.022MB] or [-23.185%; +34.264%]
scenario:renaissance:gauss-mix worse
[+739.054ms; +868.946ms] or [+4.072%; +4.788%]
unstable
[-391.916MB; +512.825MB] or [-32.918%; +43.074%]

@pr-commenter
Copy link

pr-commenter bot commented Nov 28, 2025

Benchmarks [x86_64 cpu,wall,alloc,memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.3 1.35.0-jb_tls_optim-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes cpu,wall,alloc,memleak cpu,wall,alloc,memleak
wall on on

Summary

Found 1 performance improvements and 2 performance regressions! Performance is the same for 13 metrics, 22 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:page-rank worse
[+0.918s; +1.462s] or [+1.883%; +2.997%]
unstable
[-130.219MB; +279.916MB] or [-8.921%; +19.176%]
scenario:renaissance:fj-kmeans worse
[+439.509ms; +524.491ms] or [+1.877%; +2.240%]
unstable
[-255.530MB; +350.300MB] or [-24.107%; +33.047%]
scenario:renaissance:par-mnemonics better
[-2.784s; -0.392s] or [-10.681%; -1.502%]
unstable
[-228.933MB; +308.269MB] or [-21.067%; +28.368%]

@jbachorik jbachorik mentioned this pull request Nov 28, 2025
@jbachorik jbachorik force-pushed the jb/tls_optim branch 5 times, most recently from ee450f4 to f2ae229 Compare November 28, 2025 11:05
@jbachorik jbachorik changed the title [WIP] TLS priming optimization TLS priming optimization Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants