-
Notifications
You must be signed in to change notification settings - Fork 11
TLS priming optimization #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c753eda to
f9175da
Compare
Benchmarks [x86_64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 3 performance regressions! Performance is the same for 11 metrics, 24 unstable metrics.
|
Benchmarks [x86_64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 6 performance regressions! Performance is the same for 8 metrics, 24 unstable metrics.
|
Benchmarks [x86_64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 4 performance regressions! Performance is the same for 11 metrics, 23 unstable metrics.
|
Benchmarks [aarch64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.
|
Benchmarks [aarch64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 1 performance regressions! Performance is the same for 17 metrics, 20 unstable metrics.
|
Benchmarks [aarch64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.
|
Benchmarks [aarch64 cpu,wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.
|
Benchmarks [x86_64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 3 performance regressions! Performance is the same for 12 metrics, 23 unstable metrics.
|
Benchmarks [aarch64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 2 performance regressions! Performance is the same for 14 metrics, 22 unstable metrics.
|
Benchmarks [aarch64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 1 performance regressions! Performance is the same for 17 metrics, 20 unstable metrics.
|
Benchmarks [aarch64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 4 performance regressions! Performance is the same for 14 metrics, 20 unstable metrics.
|
Benchmarks [x86_64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 3 performance regressions! Performance is the same for 13 metrics, 22 unstable metrics.
|
Benchmarks [x86_64 cpu,wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 4 performance regressions! Performance is the same for 10 metrics, 24 unstable metrics.
|
Benchmarks [x86_64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 1 performance improvements and 2 performance regressions! Performance is the same for 13 metrics, 22 unstable metrics.
|
ee450f4 to
f2ae229
Compare
a57ea0c to
cfe5cd0
Compare
What does this PR do?:
Optimizes TLS (Thread-Local Storage) priming by introducing a lock-free bitset to track Java/JVM threads. This allows the profiler to skip sending TLS priming signals to Java threads (which are already initialized via JVMTI callbacks), reducing overhead by ~95% for typical Java workloads.
Key changes:
New
LockFreeBitsettemplate class (lockFree.h): Reusable double-hashing bitset for concurrent membership trackingJava thread tracking: Register Java threads in the bitset via JVMTI callbacks, skip TLS priming signals for them
Thread watcher optimization: 20ms delay after thread creation allows JVMTI registration before checking the bitset
Motivation:
Before this optimization, the TLS priming system sent signals to ALL threads (Java + native) when new threads were detected. Since Java threads are already initialized via JVMTI
ThreadStartcallbacks, these signals were wasteful.Additionally, before TLS priming was implemented, native threads created after profiling started were completely invisible to the profiler. With this change:
Additional Notes:
DD_PROFILER_TLS_WATCHERenvironment variable or JVM system property (set to0to disable,1to enable - default enabled)[word1_0, word2_0, word1_1, word2_1, ...]ensures both hash lookups access adjacent cache linesHow to test the change?:
Unit tests: New gtest suites added:
test_lockFreeBitset.cpp: 15 tests for the generic bitset implementationtest_javaThreadBitset.cpp: 9 tests for Java thread trackingJMH benchmark:
ThreadChurnBenchmark.javameasures throughput impact under high thread churn./gradlew :ddprof-stresstest:jmh -PjmhInclude="ThreadChurnBenchmark"All existing tests pass: 113 gtest tests, full test suite
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.Unsure? Have a question? Request a review!