-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga2 crashes when sending notifications #8186
Comments
Hello @xschlef and thank you for reporting! Please enable core dumps and ask for an upload link once the bug occurs again. Best, |
It started happening again, but icinga did not produce any coredumps. Can I privately send you a stack trace of the crashes? Maybe that helps. |
Do you have any update on this? The issue seems to be recurring, at least once in two weeks. |
There aren't any core dumps and the stack traces look like this:
Cryptic memory addresses, not even line numbers. Unfortunately I can't do much. |
I uploaded at least 3 crashdumps to the nextcloud link you provided. They should contain at least some info.
|
No, you've uploaded crash reports / stack traces, but no coredumps. |
We've revisited all settings and we are still not getting any core dumps. The only thing related to our latest crash loop was a core dump of one of our checks. |
We had a few more crashes, but still no coredumps. We fixed the source of the config reloads. So I will report back in 1-2 months if the crashes disappear. Of course this is no real fix, but could explain why only we experienced this problem. |
The crash keeps happening, but only about once a month. We are still not getting any coredumps as the process seems to catch the error (nothing in kern.log / dmesg) and then terminates itself with status 134. My gdb skills are not the best:
Do you have any hints for me so I can generate a useful backtrace? |
Looks like enough of a hint to me, if there's a 0 value in icinga2/lib/notification/notificationcomponent.cpp Lines 183 to 202 in 2cb995e
This also gets written to the state file and restored on startup, so this would also explain the behavior you described in the initial message that it disappeared when removing the notification. icinga2/lib/icinga/notification.ti Lines 80 to 82 in 2cb995e
@xschlef Can you currently reproduce the crash on startup? If so, can you upload the file |
It crashed again. Here is the output of the statefile for every object with stashed_notifications set:
|
Can you double-check that you did not accidentally truncate these lines? They end in the middle of a JSON object and the number at the beginning should the length of the object and these lines are shorter than that. |
You are right, my format and search script used a wrong replacement technique. Here is the correct output:
|
Looks like you parsed the JSON in Python and then used
|
Can you supply me with a new nextcloud upload link? This dump contains a large number of hosts in our infra. |
I'm a bit surprised that this matches more than these three objects. Can you verify that my sed command actually worked? It should put each object on its own line. Anyways, you can upload it here: https://nextcloud.icinga.com/index.php/s/XX7883yQsPEWynJ |
Thanks for uploading, I just took a look at them, most are empty lists, this should be fine, and the remaining ones all look fine with one intact stashed notification. So doesn't look like it comes from the state file. So probably back to the drawing board (or looking at the code again in this case). By the way, have you upgraded Icinga 2 since the original report? I don't think this was fixed, but I prefer looking at the most recent version if I know the problem still exists there. |
we are running since 2021-02-19 |
I appear to have the same problem. But cannot figure out how to get around it. Could someone at least tell me how to clear the notifications so that I can get my environment working again? Here is the crash report.
|
more debugging, no core dumps
but none have anything in stashed_notifications that i can see.
matched nothing. Can I delete the state file? is there any way to clear the notifications so that the issue can be bypassed? |
If your system uses systemd, you can also have a look at I'm not sure if deleting the state file would fix something. I suspected it as some null object could possibly have been restored from there, but all the snippets I've seen from your state files look sane. I mean you could delete it but as the name suggests, you will lose some state (notably downtime, notification, acknowledgement state, possibly more). |
After checking via
It also appears that the application does not exit with
|
134 means exited due to SIGABRT, which - by default - also dumps core. However, I'm not sure if and how our crash handler might interfere with this. |
I tested the best I could and apport will generate .crash / coredumps (in /var/crash)
But it seems to refused to do it for icinga2.
max core is set
and I could not set the dump format
instead I left it at the default
|
deleting the icinag2.state file on the secondary node (the one that kept crashing) has resolved the problem. both services instances are now running and not crashing. Not closing issue, since the core problem has not yet been identified and resolved |
Do you still have that state file so we can have a look? If yes, I’ll provide an upload link. |
sadly I deleted it. I didn't even think to try and preserve it. And yes I understand that debugging the issue without it, is impossible |
i tried to enable the core dumps following the instruction on https://icinga.com/docs/icinga-2/latest/doc/21-development/#core-dump but nothing is written in /var/lib/cores |
I don't think there is need to further debug this as I know the cause of the crash. There just was no time to fix it so far unfortunately. |
ok, and how do I fix the crashing for now? |
all notifications that were not sent will be deleted due to this |
That's one option as this clears all the notification stuff from the state file. Another one would be to manually patch the state file and remove just the stashed notifications, but you'd have to be careful when doing this as the JSON objects in the state file are prefixed with their length. |
the stashed notifications seem to be empy:
most configuration is generated via director in our setup, do I need to remove the Notification and apply rules for Notifications? Or do i need to remove more? |
That probably doesn't search for what you think it does. You either need |
Ah found them: |
Thanks for the support. The issue is fixed on our site. |
So, this has with great success brought down my master server after upgrade to v2.13. Worst part was that during staging (and testing) the error did not manifest but when I deployed it in the production everything went to hell (1500 host, 8000 services). I realize resources are scarce and time limited (and I very much appreciate that Icinga is an OSS), but this could seriously ruin somebody's day. Please consider upping the priority on this one. |
not to confuse the state file deserializator with e.g. `"type":32` on startup. That would unexpectedly restore null (not `{"type":32}`) as there's no type "32". refs #8186
not to confuse the state file deserializator with e.g. `"type":32` on startup. That would unexpectedly restore null (not `{"type":32}`) as there's no type "32". refs #8186
i.e. the confusion of the state file deserializator with e.g. `"type":32` on startup. That would unexpectedly restore (the now ignored) null (not `{"type":32}`) as there's no type "32". refs #8186
i.e. the confusion of the state file deserializator with e.g. `"type":32` on startup. That would unexpectedly restore (the now ignored) null (not `{"type":32}`) as there's no type "32". refs #8186
Someone closed this issue just now, but due to a missing feature in the GitHub API and the high amount of comments here I can't figure out whether this issue was closed due to a PR merge. Please check by yourself whether this issue is on the correct milestone. |
Closed this issue because #9123 has been merged. |
Describe the bug
Yesterday, after a config reload, our icinga2 master kept crashing with the same stack strace.
At first it crashed after 20-30 Seconds, after a while directly after validating the configuration. This went on for an hour.
After debugging a little, we figured that it is related to notifications. We disabled all notifications and icinga2 was able to start again without crashing. Then we were able to reactivate all notifications, as it seems that there was a buggy notification queued that caused the crash. Our theory is, that disabling all notifications cleared the buggy icinga2 state.
To Reproduce
We have no idea what caused this bug, except the fact that it seems to be related to notifications.
Expected behavior
No crashes.
Your Environment
Include as many relevant details about the environment you experienced the problem in
Application version: r2.12.0-1
System information:
Platform: Debian GNU/Linux
Platform version: 9 (stretch)
Kernel: Linux
Kernel version: 4.9.0-13-amd64
Architecture: x86_64
Disabled features: compatlog debuglog elasticsearch gelf graphite icingadb livestatus opentsdb perfdata statusdata
Enabled features: api checker command ido-mysql influxdb mainlog notification syslog
Icinga Web 2: 2.8.2
Config validation:
Additional context
We are running a single master node without HA.
The text was updated successfully, but these errors were encountered: