-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Possible SIGSEGV from ddtrace or ddappsec #2030
Comments
Nick, thanks for the report. I tried to do a simple reproduction with no luck (shutdown callback auto-loads a class). Could you grab the |
Thanks @morrisonlevi. Most of the crashes were occurring with Wordpress sites if that helps. No luck with zbacktrace unfortunately. I'm not sure if it's because I'm not attached to a live gdb process or not? I'm using apport-retrace to run gdb over a crash dump. Below is what I'm doing with the noise trimmed out.
|
Hmm, I haven't encountered that one before. Thanks for trying! |
@NickStallman Hey, Your issue with zbacktrace is that you are missing debug symbols for your PHP installation. It seems to us that your interned strings table is somehow broken. |
Thanks! I thought apport-retrace would fetch all the debug symbols but it looks like an extra step was needed. This backtrace looks quite normal to me, and it's hitting the issue in the common Yoast SEO plugin.
Or a different site, with the error occurring in a different plugin.
If it's the interned strings causing the issue, that would explain why once it started to occur, it would keep occurring non-stop. It would never fix the interened strings thus hit the same issue every single time. I can confirm that after disabling ddtrace and ddappsec on Friday that there have been no further crashes so something is going on here. I may try getting a few more crash dumps out of hours now that I can get good backtraces. I'll email a few dumps through to support. |
Ok now that I can get good backtraces I'm cautiously re-enabling it to see if I can get more data. This backtrace looks different to the above backtrace but mentions ddtrace specifically.
And another backtrace. This one has no zbacktrace at all.
I collected these with:
There appears to be no crashes experienced across any of our servers if ddappsec and ddtrace aren't loaded, or with this configuration.
This is repeatable across many servers running Ubuntu 22.04.2 with the Ondrej PHP PPA. |
Thanks for the backtraces. Unfortunately the first backtrace of your last reply indicates that we have to deal with a memory corruption - and at that a corruption only happening after a few minutes. These are generally not debuggable with core dumps / backtraces. There are a few possibilities:
Just to make sure though, if you comment the |
I also wonder, is the issue related to opcache? I.e., if you disable it, do you still get crashes? |
Ah I shouldn't have any problem with compiling a copy of PHP with ASAN, the only tricky bit is if I do it locally or on a scratch VM but in that case I might not be able to reproduce the issue, or do I do it on a production server. I did just realise that I might not be able to rule out ddappsec after all. I did have it disabled in the php.ini but may have still had it enabled in each FPM pool's config. Could this issue be related to either the opcache or interned strings cache being full, with ddappsec enabled by any chance? It very likely is some interaction with ddtrace/ddappsec and opcache. I ran with just opcache over the weekend for safety with no datadog extensions and it's rock solid on it's own. I'm going to run with ddtrace on, ddappsec off, opcache on for a extended period of time today to verify if that is indeed stable. |
I have no clue why I keep getting different stack traces but this one looks quite interesting.
|
This last one again, is definitely With all these traces you've been showing me, I'm thinking we might be missing some critical cleanup handling after bailouts. Let me verify. |
Oh well this isn't good, I just noticed I also had a PHP 8.1 crash dump file this morning.
This is the same server as the previous backtraces, same ddtrace/ddappsec install and a very similar configuration.
This one does nicely highlight a fatal out of memory error at line 31. |
If my hypothesis about the issue is right, PHP 8.2 actually should not be affected by the other crashes. I'll keep you updated. This last backtrace is a bug in PHP itself though - it's not related to ddtrace at all. I've reported and fixed it: php/php-src#11189 |
A self-contained reproducer exhibiting the problem on PHP 8.1 and older:
A lot of different operations will exhibit this behavior: most importantly, a fatal error (bailout) needs to happen while a class is being autoloaded. |
We fix this issue by ensuring that the opline is always connected to its proper execute_data and popping newer entries if they aren't matching. Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>
We fix this issue by ensuring that the opline is always connected to its proper execute_data and popping newer entries if they aren't matching. Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>
We fix this issue by ensuring that the opline is always connected to its proper execute_data and popping newer entries if they aren't matching. Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>
Awesome! That sounds like you are on to it. :) I did run that test case in our PHP 7.4 environment and it didn't seem to do anything unusual by the way. No crash. Could it be a crash inside autoload AND shutdown functions? Quite a few of the crashes were shutdown function related. |
If you want to, you could try with the artifact from CI: https://output.circle-artifacts.com/output/job/850ba405-6510-419f-9fde-b5acdb9ad726/artifacts/0/datadog-setup.php This reproducer does not crash (variations of it do). But the order of operations is very weird. (executing part of the x hook after it leaves the autoloader)
Yes. Well, the autoloader corrupts the state, and the shutdown functions will read the corrupted state and explode. |
Thanks, I've deployed the latest build to one of my servers. Your test script has gone from:
to just
I'll report back with how we go today. |
Hmm still similar symptoms. This seems vaguely repeatable, I've gotten this exact same backtrace 3 times with 3 different php-fpm restarts in the same dd-trace autoload.php file. ddtrace = on, appsec = on produced this.
|
If you can provide the core dump for that trace, it might be helpful? Otherwise, as mentioned before, you might try:
|
Thanks @bwoebi I've sent this particular crash dump via the ticket. After running v0.86.3+a0a6ef9ae4e73b10ac10d483cb6ea7df45b2aa36 for a few days, I defininately think that's one bug gone, with one bug remaining.
So I'm happy to say that configuration is stable. But turning appsec.enabled to On brings back crashes, and at least this time they are all the same. Let me see what I can do about an ASAN build on the same server. It might take a few days and I'll probably have to do the testing out of hours with the build since it'll probably involve some diisruption. |
Hey @NickStallman, so we just released 0.87.0, which includes all the fixes, including a new appsec release. I don't know whether this changes anything, but appsec has reworked a lot of the shutdown sequence on their side. Could you please try it out with their new version? At least from the crash dump investigation I have absolutely no idea where the issue could come from. |
Thanks, I'll go some testing this weekend when it's a little quieter and fingers crossed the other issue also vanishes. |
I've been running 0.87.0 over the weekend and have some results. I did get 4,000 SIGSEGV's in the last 2 days, so there's still a bug somewhere. Just for perspective, those 4,000 segfaults were out of 12.8 million requests and 100,000 appsec suspicious traces so quite a small portion of our traffic is crashing. I think we can leave this ticket closed, and if I find any further information about this remaining issue then I'll make a separate ticket. |
@NickStallman Do you have stacktraces for these? Is it still the same stacktrace we had for appsec starting with zend_mm_alloc_small? By the way, something much less invasive you could try out: running fpm with I would definitely love to fix all these bugs, even if it's fataling anyway. |
Would rr make some very very big recordings? We deal with 40+ hits/second on a single server so if it records everything that wouldn't be good. I do have a new observation, I got an alert that a specific account was crashing repeatedly and this account gets very low traffic.
The matching web server log was all a single IP, detected by appsec as a security scanner. (useragent trimmed)
You can see how it keeps flip flopping between 404 and 503 errors. There was absolutely no other traffic of any kind at all. Unfortunately I could not get a usable backtrace as this occurred on a different server to the one I have all the debugging set up on. I have managed to get this backtrace a couple of times on the server set up for debugging which may be similar. It was again caused by a security scanner but the logs aren't quite as clear cut as the above example.
This is a crash very early in the WP boot sequence and each time I've seen a backtrace that looks like this it is always on this line:
|
Likely it's always at that line, because this is the first time during your script execution that a second allocation of size 8 is done within your application. (This information I could extract from an earlier core dump.) Which seems to be the corrupted pointer. The question is just why. |
Bug description
We've had a bunch of SIGSEV errors which have been getting worse over the last few days.
Yesterday it started happening within minutes or seconds of a PHP-FPM 7.4.33 restart, and once it started then basically all PHP requests to that PHP-FPM pool failed.
PHP version
Tracer version
0.86.3
Installed extensions
OS info
Diagnostics and configuration
I have not been able to directly pin it to ddtrace or ddappsec, however after disabling both the issue has immediately vanished.
So that's about as close to as a smoking gun I seem to be able to get, which is unfortunate as ddappsec is doing a fantastic job.
I've managed to get a coredump from a crash, and it appears to be related to shutdown functions.
ddtrace and ddappsec have been installed for months now, but this issue has only popped up since the latest version was deployed. I have no explanation as to why it's slowly gotten worse over time however.
It's occurring over many different codebases (mostly Wordpress based, but totally different sites with very little in common).
And it's occurring over multiple different identical servers.
No other changes have been made to these servers for months, beyond keeping ddtrace and ddappsec up to date.
The text was updated successfully, but these errors were encountered: