-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent JVM crashes on 1.16.0 #5449
Comments
Actually, the above error seems to be coming with 1.16.0 only. With 1.15.3 we were seeing this error:
|
Hi @martin-tarjanyi thanks for the bug report and apologies for the crash. We will fix this and put out a patch release ASAP. |
@martin-tarjanyi do you have any non-default profiler system properties (anything starting with |
@richardstartin no, we don't specify any extra settings |
@martin-tarjanyi 1.16.1 was released with a patch which we think will fix the issue, though because of the indirect nature of the bug (the frame in the crash report is just a symptom of something that went wrong earlier) it's difficult to say for sure that this is fixed. Please try 1.16.1 in a test environment and let us know whether it's fixed. |
Unfortunately, as of now we have no way to reproduce this on a non-prod environment. For now we are downgrading to 1.14.0 to be on the safe side. If that doesn't work out we will try this new patch version. Thanks for the quick response anyway. I'll let you know about our findings if any. |
We're seeing segfaults with version 1.16.0 and version 1.16.2 when running with Looking at the last 12 segfaults that were logged, we:
|
Hi @Stephan202, we are sorry for the inconvenience caused and thanks for a very actionable bug report. This issue appears to have been introduced originally in 1.15.0 and was reported in #5322, where it was reported that going back to 1.14.0 fixed the issue. I wonder if downgrading to 1.14.0 prevents further segfaults in your services? |
We still don't understand the cause of these segfaults and can't reproduce them in any of our internal services or test environments, but 1.16.3 has been released with more defensive checks for code paths introduced in 1.15.0. We apologise for the continued inconvenience and hope this prevents further segfaults. |
Hey @richardstartin! When I downgrade to 1.14.0 the segfaults indeed go away. Will now let things run with 1.16.3 for a while and report back later. 👍 |
@Stephan202 thanks for your efforts and reporting back. For transparency's sake, since GA, this is now the second time we have had crashes reported related to the serialisation of JFR events in our native profiler after adding new event types, and preventing it from crashing the profiled process again is going to be our top priority in the short term. We have a change in the pipeline to change the behaviour when there is a buffer overflow, which would result in a truncated recording (which we have metrics for so can react to) rather than risk crashing the process or writing to arbitrary memory locations. In the longer term, we will completely rewrite the event serialisation to prioritise safety. |
Really appreciate the transparency @richardstartin! 💪 With version 1.16.3 we did see two segfaults so far; hard to say whether the frequency is less than before. In both cases there were of this form:
|
@Stephan202 if you have the hs_err file we can see whether this is a MAP_ERR or ACC_ERR error, which would help form a hypothesis about what's happened. If you haven't got this already, in future, you can get this with |
We don't need the hs_err file, the cause is now understood. Any users encountering this issue should stay on 1.14.0 until 1.17.0 is released, or set |
1.17.0 has been released which should resolve this issue, please report back if it does not, though we reduced the problem to a reproducible test case which was fixed in 1.17.0. |
Tnx for the quick turnaround @richardstartin. We'll try it and report back in case of issues 👍 |
So far no more segfaults! 🚀 |
@Stephan202 thanks so much for the useful diagnostic information and helping to confirm the fix. I'll close this and reopen the issue if it recurs on 1.17.0+. |
@richardstartin just a heads up: at a lower frequency we did see a few more segfaults in two applications (precisely the two applications for which we enabled profiling). For one of these applications we dropped the I'll redeploy the impacted application with a In the mean time, a few short logs. The vast majority of crashes are at these two frames:
|
|
Tnx for the quick reply @richardstartin! I'll disable these flags for now also on the remaining application. W.r.t. the latter being alpha quality: perhaps the UI could have an |
We are seeing similar crashes (at |
Hi,
|
Hi @romaintonon apologies for the inconvenience. This crash is an unrelated issue. As a temporary workaround, can you set |
@romaintonon we understand the cause of the issue now and have a fix ready. This was a regression introduced in 1.18.0 so you could go back to 1.17.0 instead of disabling the flag mentioned above without loss of functionality. Thanks for reporting this issue. |
@richardstartin thank you for your quick answer. We stay on version 1.17.0 then, until you release a version that fixes this issue. |
We recently introduced the DataDog JVM agent for our applications. We used 1.15.3 initially and then upgraded to 1.16.0. With both versions we are seeing frequent JVM crashes. We get the following message during shutdown:
Can you help us understand if this error is triggered by DataDog?
The text was updated successfully, but these errors were encountered: