Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent JVM crashes on 1.16.0 #5449

Closed
martin-tarjanyi opened this issue Jun 22, 2023 · 26 comments
Closed

Frequent JVM crashes on 1.16.0 #5449

martin-tarjanyi opened this issue Jun 22, 2023 · 26 comments
Assignees
Labels

Comments

@martin-tarjanyi
Copy link

We recently introduced the DataDog JVM agent for our applications. We used 1.15.3 initially and then upgraded to 1.16.0. With both versions we are seeing frequent JVM crashes. We get the following message during shutdown:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc3e4316cdd, pid=58, tid=205
#
# JRE version: OpenJDK Runtime Environment (17.0.7+7) (build 17.0.7+7-Ubuntu-0ubuntu120.04)
# Java VM: OpenJDK 64-Bit Server VM (17.0.7+7-Ubuntu-0ubuntu120.04, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libjavaProfiler13615730602053061346.so+0x31cdd]  FlightRecorder::recordTraceRoot(int, int, TraceRootEvent*)+0x3d
#
# Core dump will be written. Default location: /opt/site/app/bin/core.58
#
# JFR recording file will be written. Location: /opt/site/app/bin/hs_err_pid58.jfr
#
# An error report file with more information is saved as:
# /opt/site/app/bin/hs_err_pid58.log
#
# If you would like to submit a bug report, please visit:
#   Unknown
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
/opt/site/app/bin/app.jar: line 279:    58 Aborted                 (core dumped) "$javaexe" "${arguments[@]}"

Can you help us understand if this error is triggered by DataDog?

@martin-tarjanyi
Copy link
Author

Actually, the above error seems to be coming with 1.16.0 only.

With 1.15.3 we were seeing this error:

java: /go/src/github.com/SL95iHeW/0/DataDog/apm-reliability/async-profiler-build/java-profiler/ddprof-lib/src/main/cpp/buffers.h:66: void Buffer::put(const char*, u32): Assertion `_offset + len < limit()' failed.

@richardstartin richardstartin self-assigned this Jun 22, 2023
@richardstartin
Copy link
Member

Hi @martin-tarjanyi thanks for the bug report and apologies for the crash. We will fix this and put out a patch release ASAP.

@richardstartin
Copy link
Member

@martin-tarjanyi do you have any non-default profiler system properties (anything starting with Ddd.profiling.*)?

@martin-tarjanyi
Copy link
Author

@richardstartin no, we don't specify any extra settings

@richardstartin
Copy link
Member

@martin-tarjanyi 1.16.1 was released with a patch which we think will fix the issue, though because of the indirect nature of the bug (the frame in the crash report is just a symptom of something that went wrong earlier) it's difficult to say for sure that this is fixed. Please try 1.16.1 in a test environment and let us know whether it's fixed.

@martin-tarjanyi
Copy link
Author

Unfortunately, as of now we have no way to reproduce this on a non-prod environment. For now we are downgrading to 1.14.0 to be on the safe side. If that doesn't work out we will try this new patch version. Thanks for the quick response anyway. I'll let you know about our findings if any.

@Stephan202
Copy link

We're seeing segfaults with version 1.16.0 and version 1.16.2 when running with -Ddd.profiling.enabled=true -Ddd.profiling.jfr-template-override-file=minimal -Ddd.logs.injection=false. They stop when we set -Ddd.profiling.enabled=false, so profiling seems implicated indeed.

Looking at the last 12 segfaults that were logged, we:

  • See this variant 6 times:
    #  SIGSEGV (0xb) at pc=0x00007f58a25e91fe, pid=1, tid=32
    #
    # JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
    # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
    # Problematic frame:
    # C  [libjavaProfiler3900364312052103352.so+0x3a1fe]  Recording::writeSettings(Buffer*, Arguments&)+0x5e
    
  • See this variant 1 time:
    #  SIGSEGV (0xb) at pc=0x00007fac6a9e91fe, pid=1, tid=32
    #
    # JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
    # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
    # Problematic frame:
    malloc_consolidate(): invalid chunk size
    # C  [libjavaProfiler13022814651276613134.so+0x3a1fe]
    
  • See this variant 3 times:
    #  SIGSEGV (0xb) at pc=0x00007fda63f1c602, pid=1, tid=32
    #
    # JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
    # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
    # Problematic frame:
    # C  [libc.so.6+0x22602]  abort+0x1ee
    
  • See this variant 2 times:
    #  SIGSEGV (0xb) at pc=0x00007f518bc7e73c, pid=1, tid=32
    #
    # JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
    # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
    # Problematic frame:
    # C  [libc.so.6+0x8773c]  cfree+0x1c
    

@richardstartin
Copy link
Member

Hi @Stephan202, we are sorry for the inconvenience caused and thanks for a very actionable bug report.

This issue appears to have been introduced originally in 1.15.0 and was reported in #5322, where it was reported that going back to 1.14.0 fixed the issue. I wonder if downgrading to 1.14.0 prevents further segfaults in your services?

@richardstartin
Copy link
Member

We still don't understand the cause of these segfaults and can't reproduce them in any of our internal services or test environments, but 1.16.3 has been released with more defensive checks for code paths introduced in 1.15.0. We apologise for the continued inconvenience and hope this prevents further segfaults.

@Stephan202
Copy link

Hey @richardstartin! When I downgrade to 1.14.0 the segfaults indeed go away. Will now let things run with 1.16.3 for a while and report back later. 👍

@richardstartin
Copy link
Member

@Stephan202 thanks for your efforts and reporting back.

For transparency's sake, since GA, this is now the second time we have had crashes reported related to the serialisation of JFR events in our native profiler after adding new event types, and preventing it from crashing the profiled process again is going to be our top priority in the short term. We have a change in the pipeline to change the behaviour when there is a buffer overflow, which would result in a truncated recording (which we have metrics for so can react to) rather than risk crashing the process or writing to arbitrary memory locations. In the longer term, we will completely rewrite the event serialisation to prioritise safety.

@Stephan202
Copy link

Really appreciate the transparency @richardstartin! 💪 With version 1.16.3 we did see two segfaults so far; hard to say whether the frequency is less than before.

In both cases there were of this form:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f016e9e8ca1, pid=1, tid=32
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libjavaProfiler15046550116111500334.so+0x39ca1]  Recording::writeSettings(Buffer*, Arguments&)+0x6

@richardstartin
Copy link
Member

@Stephan202 if you have the hs_err file we can see whether this is a MAP_ERR or ACC_ERR error, which would help form a hypothesis about what's happened. If you haven't got this already, in future, you can get this with -XX:ErrorFile=/tmp/hs_err_pid%p.log. If you don't have the file already, please stick with 1.14.0 and we will keep trying to reproduce the error and update you when it's been fixed.

@richardstartin
Copy link
Member

We don't need the hs_err file, the cause is now understood.

Any users encountering this issue should stay on 1.14.0 until 1.17.0 is released, or set -Ddd.profiling.ddprof.enabled=false ensuring fallback to built-in JFR until the user is ready to upgrade to 1.17.0.

@richardstartin
Copy link
Member

1.17.0 has been released which should resolve this issue, please report back if it does not, though we reduced the problem to a reproducible test case which was fixed in 1.17.0.

@Stephan202
Copy link

Tnx for the quick turnaround @richardstartin. We'll try it and report back in case of issues 👍

@Stephan202
Copy link

So far no more segfaults! 🚀

@richardstartin
Copy link
Member

@Stephan202 thanks so much for the useful diagnostic information and helping to confirm the fix. I'll close this and reopen the issue if it recurs on 1.17.0+.

@Stephan202
Copy link

@richardstartin just a heads up: at a lower frequency we did see a few more segfaults in two applications (precisely the two applications for which we enabled profiling). For one of these applications we dropped the -Ddd.profiling.ddprof.liveheap.enabled=true -Ddd.profiling.directallocation.enabled=true flags (because their overhead was significant), and afterwards the segfaults only happened in the other application. So likely one of these beta features is implicated.

I'll redeploy the impacted application with a -XX:ErrorFile setting that should enable exfiltration of the error log file next time it happens. If that doesn't help we could (with patience) see the impact of running with only one of these flags, and then see whether only one of the two triggers the issue.

In the mean time, a few short logs. The vast majority of crashes are at these two frames:

# V  [libjvm.so+0xba3d1d]  Method::checked_resolve_jmethod_id(_jmethodID*)+0x1d
# V  [libjvm.so+0xba3d0f]  Method::checked_resolve_jmethod_id(_jmethodID*)+0xf

@richardstartin
Copy link
Member

-Ddd.profiling.ddprof.liveheap.enabled=true is the causal factor, this relates to problems deep within the JVM and how concurrent class unloading interacts with JVMTI. My colleague @jbachorik is working on resolving this. I would discourage from using the live heap profiler for the time being.

-Ddd.profiling.directallocation.enabled=true enables a profiler implemented in Java which could under no circumstances crash your JVM, regardless of profiler bugs or JVM bugs. The direct allocation profiler is the alpha quality output of a hackathon and if this feature is valuable to you, please channel this through your TAM to help prioritise reducing the overhead. Profiler overhead issues are often related to quite specific interactions with the application so are better resolved via our support so we can look at the issue in context.

@Stephan202
Copy link

Tnx for the quick reply @richardstartin! I'll disable these flags for now also on the remaining application. W.r.t. the latter being alpha quality: perhaps the UI could have an ALPHA marker, next to the current BETA marker 😄

image

@Hexcles
Copy link

Hexcles commented Jul 13, 2023

We are seeing similar crashes (at # V [libjvm.so+0xba610f] Method::checked_resolve_jmethod_id(_jmethodID*)+0xf) with dd-trace-agent 1.17.0.

@romaintonon
Copy link

Hi,
We are facing the same issue in one of our application since version 1.18.0 of the tracer. The JVM crashes after a while. We are using temurin jdk17 in an alpine docker image. Here are the log if it can help to solve this issue :

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f80b404a17f, pid=1, tid=20
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.8+7 (17.0.8+7) (build 17.0.8+7)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.8+7 (17.0.8+7, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# C  [libjavaProfiler4010294222656270447.so+0xbb17f]  ITimer::signalHandler(int, siginfo_t*, void*)+0xbf
#
# Core dump will be written. Default location: /app/core
#
# JFR recording file will be written. Location: /app/hs_err_pid1.jfr
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

---------------  S U M M A R Y ------------

Command Line: -Xmx1g -Xms512m -XX:+UseZGC -javaagent:/dd-agent/dd-java-agent.jar -Ddd.profiling.enabled=true -Ddd.logs.injection=true -Ddd.agent.host=172.17.0.1 -Ddd.agent.port=8126 -Duser.timezone=Europe/Paris --add-opens=java.base/java.util.regex=ALL-UNNAMED --enable-preview -XX:+AlwaysPreTouch -Djava.security.egd=file:/dev/./urandom org.springframework.boot.loader.JarLauncher

Host: AMD Ryzen 5 3600X 6-Core Processor, 12 cores, 4G, Alpine Linux v3.18
Time: Thu Aug  3 09:09:38 2023 UTC elapsed time: 1636.086627 seconds (0d 0h 27m 16s)

---------------  T H R E A D  ---------------

Current thread (0x00007f80fbf2ad00):  Thread "RuntimeWorker#4" [stack: 0x00007f80fba1d000,0x00007f80fbb1daa8] [id=20]

Stack: [0x00007f80fba1d000,0x00007f80fbb1daa8],  sp=0x00007f80fbb1cfa0,  free space=1023k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libjavaProfiler4010294222656270447.so+0xbb17f]  ITimer::signalHandler(int, siginfo_t*, void*)+0xbf
C  [ld-musl-x86_64.so.1+0x495b7]

@richardstartin
Copy link
Member

Hi @romaintonon apologies for the inconvenience. This crash is an unrelated issue. As a temporary workaround, can you set -Ddd.profiling.ddprof.enabled=false which will fall back to built in JFR profiling while we reproduce and fix this issue.

@richardstartin
Copy link
Member

@romaintonon we understand the cause of the issue now and have a fix ready. This was a regression introduced in 1.18.0 so you could go back to 1.17.0 instead of disabling the flag mentioned above without loss of functionality. Thanks for reporting this issue.

@romaintonon
Copy link

@richardstartin thank you for your quick answer. We stay on version 1.17.0 then, until you release a version that fixes this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants