Fix DDTraceId/DD64bTraceId class-initialization deadlock by bm1549 · Pull Request #11509 · DataDog/dd-trace-java

bm1549 · 2026-05-29T19:03:24Z

What Does This Do

Fixes a class-initialization deadlock between DDTraceId and DD64bTraceId that can hang trace creation at startup. DDTraceId.ZERO/ONE are now backed by a private sibling type instead of DD64bTraceId, so DDTraceId.<clinit> no longer initializes its own subclass. The public DDTraceId.ZERO/ONE fields are unchanged (no binary-incompatible change). A new value-based DDTraceId.isZero() replaces the == DDTraceId.ZERO sentinel checks.

Motivation

DD64bTraceId extends DDTraceId, so the JVM initializes DDTraceId first. But DDTraceId.<clinit> built its ZERO/ONE constants via DD64bTraceId.from(...), which initializes the subclass while the DDTraceId init lock is held. When the two classes are first touched concurrently from opposite ends, each thread ends up holding one class-init lock and waiting for the other:

dd-task-scheduler: the service-discovery task added in Add support for service discovery using JNA #9705 runs muteTracing() -> blackholeSpan() -> DDTraceId.ZERO
main: the application's first span runs IdGenerationStrategy.generateTraceId() -> DD64bTraceId.from()

Trace creation then hangs. This surfaced as recurring ~30s LogInjectionSmokeTest timeouts on master (traceCount=0, process.alive=true, RC polls received: ~135). The forked-process thread dumps added in #11400 confirmed the cycle, and it reproduces deterministically.

Additional Notes

Approach: break the cycle at its source. ZERO/ONE stay public static final DDTraceId fields (the surface deliberately restored in [6to7] Restore public DDTraceId class API #5021), but are now instances of a private DDTraceId subtype, ConstantId, that is a sibling of DD64bTraceId. Because DDTraceId.<clinit> no longer references the subclass, the deadlock cannot happen regardless of timing.
Zero checks now use a value-based DDTraceId.isZero() instead of == DDTraceId.ZERO. The identity checks assumed every zero id was the single ZERO instance; isZero() recognizes a zero id of any concrete type, so the factories no longer special-case 0 and a zero parsed via the direct 64-bit factories (DD64bTraceId.fromHex in the XRay/Haystack codecs) is handled correctly. It also recognizes an all-zero 128-bit id, which == ZERO silently missed.
DDTraceIdClinitDeadlockForkedTest runs in a fresh JVM and initializes the two classes concurrently from opposite ends; it deadlocks without the fix and passes with it. TraceIdIsZeroTest and DDTraceIdConstantsTest cover isZero() and the constants across the DDTraceId subtypes.
The deadlock has been latent since Add support for service discovery using JNA #9705 (Oct 2025) added the scheduled muteTracing() task; it began manifesting recently as startup timing shifted.

Contributor Checklist

Title formatted per the contribution guidelines
type: and comp: labels assigned
No issue-linking keywords used
CODEOWNERS update not required (no file addition/migration/deletion)
Public documentation update not required (no new configuration or behavior)

Jira ticket: N/A

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56ea720eb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

dd-octo-sts · 2026-05-29T19:19:15Z

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite	Status
Startup	🟢 pass
Load	🟢 pass
DaCapo	🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Scenario	Variant	Metric	Candidate	master	Δ (95% CI of mean)
dacapo:biojava:tracing	datadog	execution_time	14.50 s	14.59 s	[-1.9%; +0.7%] (no difference)
dacapo:tomcat:tracing	datadog	execution_time	1.65 ms	1.66 ms	[-2.0%; +0.8%] (no difference)
startup:insecure-bank:iast:Agent	datadog	execution_time	14.08 s	14.01 s	[-0.7%; +1.7%] (no difference)
startup:insecure-bank:tracing:Agent	datadog	execution_time	12.90 s	12.93 s	[-1.1%; +0.6%] (no difference)
webserver--insecure-bank--tracing--high_load	load	agg_http_req_duration_p50	828.93 µs	809.73 µs	[-1.4%; +6.2%] (no difference)
startup:petclinic:appsec:Agent	datadog	execution_time	16.57 s	15.53 s	[-2.4%; +15.7%] (unstable)
startup:petclinic:iast:Agent	datadog	execution_time	16.41 s	16.46 s	[-1.5%; +0.9%] (no difference)
startup:petclinic:profiling:Agent	datadog	execution_time	16.40 s	15.62 s	[-3.8%; +13.7%] (unstable)
startup:petclinic:tracing:Agent	datadog	execution_time	15.65 s	14.87 s	[-3.3%; +13.7%] (unstable)
webserver--spring-petclinic--tracing--high_load	load	agg_http_req_duration_p50	25.12 ms	25.75 ms	[-8.6%; +3.7%] (unstable)

Commit: ded2c7c7 · CI Pipeline · Benchmarking Platform UI

dougqh

Overall, it looks good to me.
But before merging, double check that the Codex comment isn't a problem

DD64bTraceId is a subclass of DDTraceId, so the JVM must initialize DDTraceId before DD64bTraceId. DDTraceId.<clinit> in turn initialized DD64bTraceId by building its ZERO/ONE constants via DD64bTraceId.from(), a circular initialization dependency. When the two classes were first touched concurrently from opposite ends -- the service-discovery task (muteTracing() -> blackholeSpan() -> DDTraceId.ZERO) racing the application's first span (IdGenerationStrategy.generateTraceId() -> DD64bTraceId.from()) -- each thread held one class-init lock and waited for the other, hanging trace creation. This surfaced as recurring 30s LogInjectionSmokeTest timeouts in CI (latent since #9705 added the scheduled muteTracing task). Break the cycle at its source while keeping DDTraceId.ZERO/ONE as public fields (preserving the API restored in #5021): ZERO/ONE are now instances of a private DDTraceId subtype (a sibling of DD64bTraceId), so DDTraceId.<clinit> no longer references the subclass. Replace the fragile "== DDTraceId.ZERO" identity checks with a value-based DDTraceId.isZero(). Those identity checks relied on every zero id being the single ZERO instance; isZero() recognizes a zero id of any concrete type, so the factories need not route 0 to the singleton and the propagation codecs no longer mishandle a zero parsed via the direct 64-bit factories. Add a forked regression test that initializes the two classes concurrently from opposite ends (deadlocks without the fix), plus isZero() coverage across the DDTraceId subtypes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pr-commenter · 2026-05-30T14:25:36Z

Debugger benchmarks

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
ci_job_date	1780150450	1780150796
end_time	2026-05-30T14:15:37	2026-05-30T14:21:22
git_branch	master	brian.marks/fix-ddtraceid-clinit-deadlock
git_commit_sha	`064fda9`	`ded2c7c`
start_time	2026-05-30T14:14:11	2026-05-30T14:19:57

See matching parameters

	Baseline	Candidate
ci_job_id	1726969699	1726969699
ci_pipeline_id	116055918	116055918
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
git_commit_date	1780109723	1780109723

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 9 metrics, 6 unstable metrics.

See unchanged results

scenario	Δ mean agg_http_req_duration_min	Δ mean agg_http_req_duration_p50	Δ mean agg_http_req_duration_p75	Δ mean agg_http_req_duration_p99	Δ mean throughput
scenario:noprobe	unstable [-21.542µs; +30.033µs] or [-7.329%; +10.218%]	unstable [-36.009µs; +38.467µs] or [-10.571%; +11.292%]	unstable [-46.994µs; +50.253µs] or [-13.168%; +14.081%]	unstable [-65.232µs; +141.764µs] or [-5.583%; +12.133%]	same
scenario:basic	unsure [-9.611µs; -1.405µs] or [-3.575%; -0.523%]	same	same	unstable [-200.000µs; +41.540µs] or [-18.139%; +3.767%]	unstable [-125.442op/s; +125.442op/s] or [-5.018%; +5.018%]
scenario:loop	same	same	same	same	same

Request duration reports for reports

gantt
    title reports - request duration [CI 0.99] : candidate=None, baseline=None
    dateFormat X
    axisFormat %s
section baseline
noprobe (340.653 µs) : 311, 370
.   : milestone, 341,
basic (298.082 µs) : 291, 305
.   : milestone, 298,
loop (8.981 ms) : 8976, 8987
.   : milestone, 8981,
section candidate
noprobe (341.882 µs) : 302, 382
.   : milestone, 342,
basic (294.321 µs) : 288, 300
.   : milestone, 294,
loop (8.982 ms) : 8977, 8988
.   : milestone, 8982,

baseline results

Scenario	Request median duration [CI 0.99]
noprobe	340.653 µs [311.05 µs, 370.257 µs]
basic	298.082 µs [290.833 µs, 305.331 µs]
loop	8.981 ms [8.976 ms, 8.987 ms]

candidate results

Scenario	Request median duration [CI 0.99]
noprobe	341.882 µs [301.518 µs, 382.247 µs]
basic	294.321 µs [288.216 µs, 300.426 µs]
loop	8.982 ms [8.977 ms, 8.988 ms]

pr-commenter · 2026-05-30T14:40:58Z

Kafka / producer-benchmark

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	brian.marks/fix-ddtraceid-clinit-deadlock
git_commit_date	1780097219	1780109723
git_commit_sha	`194ee63`	`ded2c7c`

See matching parameters

	Baseline	Candidate
ci_job_date	1780150961	1780150961
ci_job_id	1726969697	1726969697
ci_pipeline_id	116055918	116055918
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion	11.0.25	11.0.25
jmhVersion	1.36	1.36
jvm	/usr/lib/jvm/java-11-openjdk-amd64/bin/java	/usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs	-Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant	-Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName	OpenJDK 64-Bit Server VM	OpenJDK 64-Bit Server VM
vmVersion	11.0.25+9-post-Ubuntu-1ubuntu122.04	11.0.25+9-post-Ubuntu-1ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results

scenario	Δ mean throughput
scenario:not-instrumented/KafkaProduceBenchmark.benchProduce	same
scenario:only-tracing-dsm-disabled-benchmarks/KafkaProduceBenchmark.benchProduce	same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaProduceBenchmark.benchProduce	same

pr-commenter · 2026-05-30T14:53:17Z

Kafka / consumer-benchmark

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	brian.marks/fix-ddtraceid-clinit-deadlock
git_commit_date	1780097219	1780109723
git_commit_sha	`194ee63`	`ded2c7c`

See matching parameters

	Baseline	Candidate
ci_job_date	1780150992	1780150992
ci_job_id	1726969698	1726969698
ci_pipeline_id	116055918	116055918
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion	11.0.25	11.0.25
jmhVersion	1.36	1.36
jvm	/usr/lib/jvm/java-11-openjdk-amd64/bin/java	/usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs	-Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant	-Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName	OpenJDK 64-Bit Server VM	OpenJDK 64-Bit Server VM
vmVersion	11.0.25+9-post-Ubuntu-1ubuntu122.04	11.0.25+9-post-Ubuntu-1ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results

scenario	Δ mean throughput
scenario:not-instrumented/KafkaConsumerBenchmark.benchConsume	unsure [+2328.734op/s; +12483.668op/s] or [+0.799%; +4.286%]
scenario:only-tracing-dsm-disabled-benchmarks/KafkaConsumerBenchmark.benchConsume	same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaConsumerBenchmark.benchConsume	same

bm1549 added type: bug Bug report and fix comp: api Tracer public API tag: ai generated Largely based on code generated by an AI or LLM labels May 29, 2026

This comment has been minimized.

Sign in to view

bm1549 marked this pull request as ready for review May 29, 2026 19:13

bm1549 requested a review from a team as a code owner May 29, 2026 19:13

bm1549 requested a review from dougqh May 29, 2026 19:13

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

Comment thread dd-trace-api/src/main/java/datadog/trace/api/DD64bTraceId.java

bm1549 requested a review from a team as a code owner May 29, 2026 19:29

bm1549 requested review from mcculls and removed request for a team May 29, 2026 19:29

dougqh reviewed May 29, 2026

View reviewed changes

bm1549 force-pushed the brian.marks/fix-ddtraceid-clinit-deadlock branch from b04a0d1 to 0e15d6c Compare May 30, 2026 02:46

bm1549 requested a review from a team as a code owner May 30, 2026 02:46

Merge branch 'master' into brian.marks/fix-ddtraceid-clinit-deadlock

ded2c7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509

Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509
bm1549 wants to merge 2 commits into
masterfrom
brian.marks/fix-ddtraceid-clinit-deadlock

bm1549 commented May 29, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

dd-octo-sts Bot commented May 29, 2026 •

edited

Loading

Uh oh!

dougqh left a comment

Uh oh!

pr-commenter Bot commented May 30, 2026

Uh oh!

pr-commenter Bot commented May 30, 2026

Uh oh!

pr-commenter Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bm1549 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Additional Notes

Contributor Checklist

Uh oh!

This comment has been minimized.

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

dd-octo-sts Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟢 Java Benchmark SLOs — All performance SLOs passed

Uh oh!

dougqh left a comment

Choose a reason for hiding this comment

Uh oh!

pr-commenter Bot commented May 30, 2026

Debugger benchmarks

Parameters

Summary

Uh oh!

pr-commenter Bot commented May 30, 2026

Kafka / producer-benchmark

Parameters

Summary

Uh oh!

pr-commenter Bot commented May 30, 2026

Kafka / consumer-benchmark

Parameters

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bm1549 commented May 29, 2026 •

edited

Loading

dd-octo-sts Bot commented May 29, 2026 •

edited

Loading