Skip to content

Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509

Open
bm1549 wants to merge 2 commits into
masterfrom
brian.marks/fix-ddtraceid-clinit-deadlock
Open

Fix DDTraceId/DD64bTraceId class-initialization deadlock#11509
bm1549 wants to merge 2 commits into
masterfrom
brian.marks/fix-ddtraceid-clinit-deadlock

Conversation

@bm1549
Copy link
Copy Markdown
Contributor

@bm1549 bm1549 commented May 29, 2026

What Does This Do

Fixes a class-initialization deadlock between DDTraceId and DD64bTraceId that can hang trace creation at startup. DDTraceId.ZERO/ONE are now backed by a private sibling type instead of DD64bTraceId, so DDTraceId.<clinit> no longer initializes its own subclass. The public DDTraceId.ZERO/ONE fields are unchanged (no binary-incompatible change). A new value-based DDTraceId.isZero() replaces the == DDTraceId.ZERO sentinel checks.

Motivation

DD64bTraceId extends DDTraceId, so the JVM initializes DDTraceId first. But DDTraceId.<clinit> built its ZERO/ONE constants via DD64bTraceId.from(...), which initializes the subclass while the DDTraceId init lock is held. When the two classes are first touched concurrently from opposite ends, each thread ends up holding one class-init lock and waiting for the other:

  • dd-task-scheduler: the service-discovery task added in Add support for service discovery using JNA #9705 runs muteTracing() -> blackholeSpan() -> DDTraceId.ZERO
  • main: the application's first span runs IdGenerationStrategy.generateTraceId() -> DD64bTraceId.from()

Trace creation then hangs. This surfaced as recurring ~30s LogInjectionSmokeTest timeouts on master (traceCount=0, process.alive=true, RC polls received: ~135). The forked-process thread dumps added in #11400 confirmed the cycle, and it reproduces deterministically.

Additional Notes

  • Approach: break the cycle at its source. ZERO/ONE stay public static final DDTraceId fields (the surface deliberately restored in [6to7] Restore public DDTraceId class API #5021), but are now instances of a private DDTraceId subtype, ConstantId, that is a sibling of DD64bTraceId. Because DDTraceId.<clinit> no longer references the subclass, the deadlock cannot happen regardless of timing.
  • Zero checks now use a value-based DDTraceId.isZero() instead of == DDTraceId.ZERO. The identity checks assumed every zero id was the single ZERO instance; isZero() recognizes a zero id of any concrete type, so the factories no longer special-case 0 and a zero parsed via the direct 64-bit factories (DD64bTraceId.fromHex in the XRay/Haystack codecs) is handled correctly. It also recognizes an all-zero 128-bit id, which == ZERO silently missed.
  • DDTraceIdClinitDeadlockForkedTest runs in a fresh JVM and initializes the two classes concurrently from opposite ends; it deadlocks without the fix and passes with it. TraceIdIsZeroTest and DDTraceIdConstantsTest cover isZero() and the constants across the DDTraceId subtypes.
  • The deadlock has been latent since Add support for service discovery using JNA #9705 (Oct 2025) added the scheduled muteTracing() task; it began manifesting recently as startup timing shifted.

Contributor Checklist

  • Title formatted per the contribution guidelines
  • type: and comp: labels assigned
  • No issue-linking keywords used
  • CODEOWNERS update not required (no file addition/migration/deletion)
  • Public documentation update not required (no new configuration or behavior)

Jira ticket: N/A

@bm1549 bm1549 added type: bug Bug report and fix comp: api Tracer public API tag: ai generated Largely based on code generated by an AI or LLM labels May 29, 2026
@datadog-prod-us1-3

This comment has been minimized.

@bm1549 bm1549 marked this pull request as ready for review May 29, 2026 19:13
@bm1549 bm1549 requested a review from a team as a code owner May 29, 2026 19:13
@bm1549 bm1549 requested a review from dougqh May 29, 2026 19:13
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56ea720eb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread dd-trace-api/src/main/java/datadog/trace/api/DD64bTraceId.java
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 29, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass
Load 🟢 pass
DaCapo 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Variant Metric Candidate master Δ (95% CI of mean)
dacapo:biojava:tracing datadog execution_time 14.50 s 14.59 s [-1.9%; +0.7%] (no difference)
dacapo:tomcat:tracing datadog execution_time 1.65 ms 1.66 ms [-2.0%; +0.8%] (no difference)
startup:insecure-bank:iast:Agent datadog execution_time 14.08 s 14.01 s [-0.7%; +1.7%] (no difference)
startup:insecure-bank:tracing:Agent datadog execution_time 12.90 s 12.93 s [-1.1%; +0.6%] (no difference)
webserver--insecure-bank--tracing--high_load load agg_http_req_duration_p50 828.93 µs 809.73 µs [-1.4%; +6.2%] (no difference)
startup:petclinic:appsec:Agent datadog execution_time 16.57 s 15.53 s [-2.4%; +15.7%] (unstable)
startup:petclinic:iast:Agent datadog execution_time 16.41 s 16.46 s [-1.5%; +0.9%] (no difference)
startup:petclinic:profiling:Agent datadog execution_time 16.40 s 15.62 s [-3.8%; +13.7%] (unstable)
startup:petclinic:tracing:Agent datadog execution_time 15.65 s 14.87 s [-3.3%; +13.7%] (unstable)
webserver--spring-petclinic--tracing--high_load load agg_http_req_duration_p50 25.12 ms 25.75 ms [-8.6%; +3.7%] (unstable)

Commit: ded2c7c7 · CI Pipeline · Benchmarking Platform UI

@bm1549 bm1549 requested a review from a team as a code owner May 29, 2026 19:29
@bm1549 bm1549 requested review from mcculls and removed request for a team May 29, 2026 19:29
Copy link
Copy Markdown
Contributor

@dougqh dougqh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good to me.
But before merging, double check that the Codex comment isn't a problem

DD64bTraceId is a subclass of DDTraceId, so the JVM must initialize
DDTraceId before DD64bTraceId. DDTraceId.<clinit> in turn initialized
DD64bTraceId by building its ZERO/ONE constants via DD64bTraceId.from(),
a circular initialization dependency. When the two classes were first
touched concurrently from opposite ends -- the service-discovery task
(muteTracing() -> blackholeSpan() -> DDTraceId.ZERO) racing the
application's first span (IdGenerationStrategy.generateTraceId() ->
DD64bTraceId.from()) -- each thread held one class-init lock and waited
for the other, hanging trace creation. This surfaced as recurring 30s
LogInjectionSmokeTest timeouts in CI (latent since #9705 added the
scheduled muteTracing task).

Break the cycle at its source while keeping DDTraceId.ZERO/ONE as public
fields (preserving the API restored in #5021): ZERO/ONE are now instances
of a private DDTraceId subtype (a sibling of DD64bTraceId), so
DDTraceId.<clinit> no longer references the subclass.

Replace the fragile "== DDTraceId.ZERO" identity checks with a
value-based DDTraceId.isZero(). Those identity checks relied on every
zero id being the single ZERO instance; isZero() recognizes a zero id of
any concrete type, so the factories need not route 0 to the singleton and
the propagation codecs no longer mishandle a zero parsed via the direct
64-bit factories.

Add a forked regression test that initializes the two classes
concurrently from opposite ends (deadlocks without the fix), plus
isZero() coverage across the DDTraceId subtypes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bm1549 bm1549 force-pushed the brian.marks/fix-ddtraceid-clinit-deadlock branch from b04a0d1 to 0e15d6c Compare May 30, 2026 02:46
@bm1549 bm1549 requested a review from a team as a code owner May 30, 2026 02:46
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 30, 2026

Debugger benchmarks

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
ci_job_date 1780150450 1780150796
end_time 2026-05-30T14:15:37 2026-05-30T14:21:22
git_branch master brian.marks/fix-ddtraceid-clinit-deadlock
git_commit_sha 064fda9 ded2c7c
start_time 2026-05-30T14:14:11 2026-05-30T14:19:57
See matching parameters
Baseline Candidate
ci_job_id 1726969699 1726969699
ci_pipeline_id 116055918 116055918
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
git_commit_date 1780109723 1780109723

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 9 metrics, 6 unstable metrics.

See unchanged results
scenario Δ mean agg_http_req_duration_min Δ mean agg_http_req_duration_p50 Δ mean agg_http_req_duration_p75 Δ mean agg_http_req_duration_p99 Δ mean throughput
scenario:noprobe unstable
[-21.542µs; +30.033µs] or [-7.329%; +10.218%]
unstable
[-36.009µs; +38.467µs] or [-10.571%; +11.292%]
unstable
[-46.994µs; +50.253µs] or [-13.168%; +14.081%]
unstable
[-65.232µs; +141.764µs] or [-5.583%; +12.133%]
same
scenario:basic unsure
[-9.611µs; -1.405µs] or [-3.575%; -0.523%]
same same unstable
[-200.000µs; +41.540µs] or [-18.139%; +3.767%]
unstable
[-125.442op/s; +125.442op/s] or [-5.018%; +5.018%]
scenario:loop same same same same same
Request duration reports for reports
gantt
    title reports - request duration [CI 0.99] : candidate=None, baseline=None
    dateFormat X
    axisFormat %s
section baseline
noprobe (340.653 µs) : 311, 370
.   : milestone, 341,
basic (298.082 µs) : 291, 305
.   : milestone, 298,
loop (8.981 ms) : 8976, 8987
.   : milestone, 8981,
section candidate
noprobe (341.882 µs) : 302, 382
.   : milestone, 342,
basic (294.321 µs) : 288, 300
.   : milestone, 294,
loop (8.982 ms) : 8977, 8988
.   : milestone, 8982,
Loading
  • baseline results
Scenario Request median duration [CI 0.99]
noprobe 340.653 µs [311.05 µs, 370.257 µs]
basic 298.082 µs [290.833 µs, 305.331 µs]
loop 8.981 ms [8.976 ms, 8.987 ms]
  • candidate results
Scenario Request median duration [CI 0.99]
noprobe 341.882 µs [301.518 µs, 382.247 µs]
basic 294.321 µs [288.216 µs, 300.426 µs]
loop 8.982 ms [8.977 ms, 8.988 ms]

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 30, 2026

Kafka / producer-benchmark

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master brian.marks/fix-ddtraceid-clinit-deadlock
git_commit_date 1780097219 1780109723
git_commit_sha 194ee63 ded2c7c
See matching parameters
Baseline Candidate
ci_job_date 1780150961 1780150961
ci_job_id 1726969697 1726969697
ci_pipeline_id 116055918 116055918
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion 11.0.25 11.0.25
jmhVersion 1.36 1.36
jvm /usr/lib/jvm/java-11-openjdk-amd64/bin/java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName OpenJDK 64-Bit Server VM OpenJDK 64-Bit Server VM
vmVersion 11.0.25+9-post-Ubuntu-1ubuntu122.04 11.0.25+9-post-Ubuntu-1ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean throughput
scenario:not-instrumented/KafkaProduceBenchmark.benchProduce same
scenario:only-tracing-dsm-disabled-benchmarks/KafkaProduceBenchmark.benchProduce same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaProduceBenchmark.benchProduce same

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 30, 2026

Kafka / consumer-benchmark

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master brian.marks/fix-ddtraceid-clinit-deadlock
git_commit_date 1780097219 1780109723
git_commit_sha 194ee63 ded2c7c
See matching parameters
Baseline Candidate
ci_job_date 1780150992 1780150992
ci_job_id 1726969698 1726969698
ci_pipeline_id 116055918 116055918
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion 11.0.25 11.0.25
jmhVersion 1.36 1.36
jvm /usr/lib/jvm/java-11-openjdk-amd64/bin/java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName OpenJDK 64-Bit Server VM OpenJDK 64-Bit Server VM
vmVersion 11.0.25+9-post-Ubuntu-1ubuntu122.04 11.0.25+9-post-Ubuntu-1ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean throughput
scenario:not-instrumented/KafkaConsumerBenchmark.benchConsume unsure
[+2328.734op/s; +12483.668op/s] or [+0.799%; +4.286%]
scenario:only-tracing-dsm-disabled-benchmarks/KafkaConsumerBenchmark.benchConsume same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaConsumerBenchmark.benchConsume same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: api Tracer public API tag: ai generated Largely based on code generated by an AI or LLM type: bug Bug report and fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants