Frequent JVM crashes on 1.16.0 #5449

martin-tarjanyi · 2023-06-22T13:14:02Z

We recently introduced the DataDog JVM agent for our applications. We used 1.15.3 initially and then upgraded to 1.16.0. With both versions we are seeing frequent JVM crashes. We get the following message during shutdown:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc3e4316cdd, pid=58, tid=205
#
# JRE version: OpenJDK Runtime Environment (17.0.7+7) (build 17.0.7+7-Ubuntu-0ubuntu120.04)
# Java VM: OpenJDK 64-Bit Server VM (17.0.7+7-Ubuntu-0ubuntu120.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libjavaProfiler13615730602053061346.so+0x31cdd]  FlightRecorder::recordTraceRoot(int, int, TraceRootEvent*)+0x3d
#
# Core dump will be written. Default location: /opt/site/app/bin/core.58
#
# JFR recording file will be written. Location: /opt/site/app/bin/hs_err_pid58.jfr
#
# An error report file with more information is saved as:
# /opt/site/app/bin/hs_err_pid58.log
#
# If you would like to submit a bug report, please visit:
#   Unknown
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
/opt/site/app/bin/app.jar: line 279:    58 Aborted                 (core dumped) "$javaexe" "${arguments[@]}"

Can you help us understand if this error is triggered by DataDog?

The text was updated successfully, but these errors were encountered:

martin-tarjanyi · 2023-06-22T13:20:40Z

Actually, the above error seems to be coming with 1.16.0 only.

With 1.15.3 we were seeing this error:

java: /go/src/github.com/SL95iHeW/0/DataDog/apm-reliability/async-profiler-build/java-profiler/ddprof-lib/src/main/cpp/buffers.h:66: void Buffer::put(const char*, u32): Assertion `_offset + len < limit()' failed.

richardstartin · 2023-06-22T13:26:38Z

Hi @martin-tarjanyi thanks for the bug report and apologies for the crash. We will fix this and put out a patch release ASAP.

richardstartin · 2023-06-22T15:03:59Z

@martin-tarjanyi do you have any non-default profiler system properties (anything starting with Ddd.profiling.*)?

martin-tarjanyi · 2023-06-23T07:44:08Z

@richardstartin no, we don't specify any extra settings

richardstartin · 2023-06-23T11:21:19Z

@martin-tarjanyi 1.16.1 was released with a patch which we think will fix the issue, though because of the indirect nature of the bug (the frame in the crash report is just a symptom of something that went wrong earlier) it's difficult to say for sure that this is fixed. Please try 1.16.1 in a test environment and let us know whether it's fixed.

martin-tarjanyi · 2023-06-23T12:14:13Z

Unfortunately, as of now we have no way to reproduce this on a non-prod environment. For now we are downgrading to 1.14.0 to be on the safe side. If that doesn't work out we will try this new patch version. Thanks for the quick response anyway. I'll let you know about our findings if any.

Stephan202 · 2023-06-26T20:36:56Z

We're seeing segfaults with version 1.16.0 and version 1.16.2 when running with -Ddd.profiling.enabled=true -Ddd.profiling.jfr-template-override-file=minimal -Ddd.logs.injection=false. They stop when we set -Ddd.profiling.enabled=false, so profiling seems implicated indeed.

Looking at the last 12 segfaults that were logged, we:

See this variant 6 times:

#  SIGSEGV (0xb) at pc=0x00007f58a25e91fe, pid=1, tid=32
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libjavaProfiler3900364312052103352.so+0x3a1fe]  Recording::writeSettings(Buffer*, Arguments&)+0x5e

See this variant 1 time:

#  SIGSEGV (0xb) at pc=0x00007fac6a9e91fe, pid=1, tid=32
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
malloc_consolidate(): invalid chunk size
# C  [libjavaProfiler13022814651276613134.so+0x3a1fe]

See this variant 3 times:

#  SIGSEGV (0xb) at pc=0x00007fda63f1c602, pid=1, tid=32
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0x22602]  abort+0x1ee

See this variant 2 times:

#  SIGSEGV (0xb) at pc=0x00007f518bc7e73c, pid=1, tid=32
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0x8773c]  cfree+0x1c

richardstartin · 2023-06-26T21:45:40Z

Hi @Stephan202, we are sorry for the inconvenience caused and thanks for a very actionable bug report.

This issue appears to have been introduced originally in 1.15.0 and was reported in #5322, where it was reported that going back to 1.14.0 fixed the issue. I wonder if downgrading to 1.14.0 prevents further segfaults in your services?

richardstartin · 2023-06-27T08:59:06Z

We still don't understand the cause of these segfaults and can't reproduce them in any of our internal services or test environments, but 1.16.3 has been released with more defensive checks for code paths introduced in 1.15.0. We apologise for the continued inconvenience and hope this prevents further segfaults.

Stephan202 · 2023-06-27T10:32:25Z

Hey @richardstartin! When I downgrade to 1.14.0 the segfaults indeed go away. Will now let things run with 1.16.3 for a while and report back later. 👍

richardstartin · 2023-06-27T11:42:15Z

@Stephan202 thanks for your efforts and reporting back.

For transparency's sake, since GA, this is now the second time we have had crashes reported related to the serialisation of JFR events in our native profiler after adding new event types, and preventing it from crashing the profiled process again is going to be our top priority in the short term. We have a change in the pipeline to change the behaviour when there is a buffer overflow, which would result in a truncated recording (which we have metrics for so can react to) rather than risk crashing the process or writing to arbitrary memory locations. In the longer term, we will completely rewrite the event serialisation to prioritise safety.

Stephan202 · 2023-06-27T13:03:19Z

Really appreciate the transparency @richardstartin! 💪 With version 1.16.3 we did see two segfaults so far; hard to say whether the frequency is less than before.

In both cases there were of this form:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f016e9e8ca1, pid=1, tid=32
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libjavaProfiler15046550116111500334.so+0x39ca1]  Recording::writeSettings(Buffer*, Arguments&)+0x6

richardstartin · 2023-06-27T13:46:19Z

@Stephan202 if you have the hs_err file we can see whether this is a MAP_ERR or ACC_ERR error, which would help form a hypothesis about what's happened. If you haven't got this already, in future, you can get this with -XX:ErrorFile=/tmp/hs_err_pid%p.log. If you don't have the file already, please stick with 1.14.0 and we will keep trying to reproduce the error and update you when it's been fixed.

richardstartin · 2023-06-27T15:45:10Z

We don't need the hs_err file, the cause is now understood.

Any users encountering this issue should stay on 1.14.0 until 1.17.0 is released, or set -Ddd.profiling.ddprof.enabled=false ensuring fallback to built-in JFR until the user is ready to upgrade to 1.17.0.

richardstartin · 2023-06-28T10:00:44Z

1.17.0 has been released which should resolve this issue, please report back if it does not, though we reduced the problem to a reproducible test case which was fixed in 1.17.0.

Stephan202 · 2023-06-28T11:07:01Z

Tnx for the quick turnaround @richardstartin. We'll try it and report back in case of issues 👍

Stephan202 · 2023-06-28T15:36:44Z

So far no more segfaults! 🚀

richardstartin · 2023-06-28T15:38:20Z

@Stephan202 thanks so much for the useful diagnostic information and helping to confirm the fix. I'll close this and reopen the issue if it recurs on 1.17.0+.

Stephan202 · 2023-07-02T09:53:34Z

@richardstartin just a heads up: at a lower frequency we did see a few more segfaults in two applications (precisely the two applications for which we enabled profiling). For one of these applications we dropped the -Ddd.profiling.ddprof.liveheap.enabled=true -Ddd.profiling.directallocation.enabled=true flags (because their overhead was significant), and afterwards the segfaults only happened in the other application. So likely one of these beta features is implicated.

I'll redeploy the impacted application with a -XX:ErrorFile setting that should enable exfiltration of the error log file next time it happens. If that doesn't help we could (with patience) see the impact of running with only one of these flags, and then see whether only one of the two triggers the issue.

In the mean time, a few short logs. The vast majority of crashes are at these two frames:

# V  [libjvm.so+0xba3d1d]  Method::checked_resolve_jmethod_id(_jmethodID*)+0x1d
# V  [libjvm.so+0xba3d0f]  Method::checked_resolve_jmethod_id(_jmethodID*)+0xf

richardstartin · 2023-07-02T10:27:18Z

-Ddd.profiling.ddprof.liveheap.enabled=true is the causal factor, this relates to problems deep within the JVM and how concurrent class unloading interacts with JVMTI. My colleague @jbachorik is working on resolving this. I would discourage from using the live heap profiler for the time being.

-Ddd.profiling.directallocation.enabled=true enables a profiler implemented in Java which could under no circumstances crash your JVM, regardless of profiler bugs or JVM bugs. The direct allocation profiler is the alpha quality output of a hackathon and if this feature is valuable to you, please channel this through your TAM to help prioritise reducing the overhead. Profiler overhead issues are often related to quite specific interactions with the application so are better resolved via our support so we can look at the issue in context.

Stephan202 · 2023-07-02T11:32:46Z

Tnx for the quick reply @richardstartin! I'll disable these flags for now also on the remaining application. W.r.t. the latter being alpha quality: perhaps the UI could have an ALPHA marker, next to the current BETA marker 😄

Hexcles · 2023-07-13T00:11:41Z

We are seeing similar crashes (at # V [libjvm.so+0xba610f] Method::checked_resolve_jmethod_id(_jmethodID*)+0xf) with dd-trace-agent 1.17.0.

romaintonon · 2023-08-03T12:31:32Z

Hi,
We are facing the same issue in one of our application since version 1.18.0 of the tracer. The JVM crashes after a while. We are using temurin jdk17 in an alpine docker image. Here are the log if it can help to solve this issue :

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f80b404a17f, pid=1, tid=20
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.8+7 (17.0.8+7) (build 17.0.8+7)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.8+7 (17.0.8+7, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# C  [libjavaProfiler4010294222656270447.so+0xbb17f]  ITimer::signalHandler(int, siginfo_t*, void*)+0xbf
#
# Core dump will be written. Default location: /app/core
#
# JFR recording file will be written. Location: /app/hs_err_pid1.jfr
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

---------------  S U M M A R Y ------------

Command Line: -Xmx1g -Xms512m -XX:+UseZGC -javaagent:/dd-agent/dd-java-agent.jar -Ddd.profiling.enabled=true -Ddd.logs.injection=true -Ddd.agent.host=172.17.0.1 -Ddd.agent.port=8126 -Duser.timezone=Europe/Paris --add-opens=java.base/java.util.regex=ALL-UNNAMED --enable-preview -XX:+AlwaysPreTouch -Djava.security.egd=file:/dev/./urandom org.springframework.boot.loader.JarLauncher

Host: AMD Ryzen 5 3600X 6-Core Processor, 12 cores, 4G, Alpine Linux v3.18
Time: Thu Aug  3 09:09:38 2023 UTC elapsed time: 1636.086627 seconds (0d 0h 27m 16s)

---------------  T H R E A D  ---------------

Current thread (0x00007f80fbf2ad00):  Thread "RuntimeWorker#4" [stack: 0x00007f80fba1d000,0x00007f80fbb1daa8] [id=20]

Stack: [0x00007f80fba1d000,0x00007f80fbb1daa8],  sp=0x00007f80fbb1cfa0,  free space=1023k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libjavaProfiler4010294222656270447.so+0xbb17f]  ITimer::signalHandler(int, siginfo_t*, void*)+0xbf
C  [ld-musl-x86_64.so.1+0x495b7]

richardstartin · 2023-08-03T12:50:44Z

Hi @romaintonon apologies for the inconvenience. This crash is an unrelated issue. As a temporary workaround, can you set -Ddd.profiling.ddprof.enabled=false which will fall back to built in JFR profiling while we reproduce and fix this issue.

richardstartin · 2023-08-03T14:03:05Z

@romaintonon we understand the cause of the issue now and have a fix ready. This was a regression introduced in 1.18.0 so you could go back to 1.17.0 instead of disabling the flag mentioned above without loss of functionality. Thanks for reporting this issue.

romaintonon · 2023-08-03T14:10:57Z

@richardstartin thank you for your quick answer. We stay on version 1.17.0 then, until you release a version that fixes this issue.

richardstartin self-assigned this Jun 22, 2023

richardstartin added the comp: profiling Profiling label Jun 22, 2023

richardstartin mentioned this issue Jun 23, 2023

upgrade to ddprof 0.52.0 #5455

Merged

richardstartin closed this as completed Jun 28, 2023

credpath-seek mentioned this issue Jun 20, 2024

v1.34.0 JVM crash #7144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent JVM crashes on 1.16.0 #5449

Frequent JVM crashes on 1.16.0 #5449

martin-tarjanyi commented Jun 22, 2023

martin-tarjanyi commented Jun 22, 2023

richardstartin commented Jun 22, 2023

richardstartin commented Jun 22, 2023

martin-tarjanyi commented Jun 23, 2023

richardstartin commented Jun 23, 2023

martin-tarjanyi commented Jun 23, 2023

Stephan202 commented Jun 26, 2023

richardstartin commented Jun 26, 2023

richardstartin commented Jun 27, 2023

Stephan202 commented Jun 27, 2023

richardstartin commented Jun 27, 2023

Stephan202 commented Jun 27, 2023

richardstartin commented Jun 27, 2023

richardstartin commented Jun 27, 2023

richardstartin commented Jun 28, 2023

Stephan202 commented Jun 28, 2023

Stephan202 commented Jun 28, 2023

richardstartin commented Jun 28, 2023

Stephan202 commented Jul 2, 2023

richardstartin commented Jul 2, 2023

Stephan202 commented Jul 2, 2023

Hexcles commented Jul 13, 2023

romaintonon commented Aug 3, 2023

richardstartin commented Aug 3, 2023

richardstartin commented Aug 3, 2023

romaintonon commented Aug 3, 2023

Frequent JVM crashes on 1.16.0 #5449

Frequent JVM crashes on 1.16.0 #5449

Comments

martin-tarjanyi commented Jun 22, 2023

martin-tarjanyi commented Jun 22, 2023

richardstartin commented Jun 22, 2023

richardstartin commented Jun 22, 2023

martin-tarjanyi commented Jun 23, 2023

richardstartin commented Jun 23, 2023

martin-tarjanyi commented Jun 23, 2023

Stephan202 commented Jun 26, 2023

richardstartin commented Jun 26, 2023

richardstartin commented Jun 27, 2023

Stephan202 commented Jun 27, 2023

richardstartin commented Jun 27, 2023

Stephan202 commented Jun 27, 2023

richardstartin commented Jun 27, 2023

richardstartin commented Jun 27, 2023

richardstartin commented Jun 28, 2023

Stephan202 commented Jun 28, 2023

Stephan202 commented Jun 28, 2023

richardstartin commented Jun 28, 2023

Stephan202 commented Jul 2, 2023

richardstartin commented Jul 2, 2023

Stephan202 commented Jul 2, 2023

Hexcles commented Jul 13, 2023

romaintonon commented Aug 3, 2023

richardstartin commented Aug 3, 2023

richardstartin commented Aug 3, 2023

romaintonon commented Aug 3, 2023