Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga2 crashes when sending notifications #8186

Closed
xschlef opened this issue Aug 21, 2020 · 45 comments · Fixed by #9123
Closed

Icinga2 crashes when sending notifications #8186

xschlef opened this issue Aug 21, 2020 · 45 comments · Fixed by #9123
Assignees
Labels
area/notifications Notification events bug Something isn't working core/crash Shouldn't happen, requires attention
Milestone

Comments

@xschlef
Copy link

xschlef commented Aug 21, 2020

Describe the bug

Yesterday, after a config reload, our icinga2 master kept crashing with the same stack strace.
At first it crashed after 20-30 Seconds, after a while directly after validating the configuration. This went on for an hour.

        (0) libc.so.6: gsignal (+0xcf) [0x7f0597314fff]
        (1) libc.so.6: abort (+0x16a) [0x7f059731642a]
        (2) libc.so.6: <unknown function> (+0x2be67) [0x7f059730de67]
        (3) libc.so.6: <unknown function> (+0x2bf12) [0x7f059730df12]
        (4) icinga2: icinga::NotificationComponent::NotificationTimerHandler() (+0x159b) [0x561d3dec965b]
        (5) icinga2: <unknown function> (+0x338271) [0x561d3db5b271]
        (6) icinga2: icinga::Timer::Call() (+0x2d) [0x561d3daf94bd]
        (7) icinga2: <unknown function> (+0x341e0d) [0x561d3db64e0d]
        (8) icinga2: boost::asio::detail::executor_op<boost::asio::detail::work_dispatcher<bool icinga::ThreadPool::Post<std::function<void ()> >(std::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, std::allocator<void>*, boost::system::error_code const&, unsigned long) (+0xe3) [0x561d3e0018b3]
        (9) icinga2: <unknown function> (+0x27d5f6) [0x561d3daa05f6]
        (10) icinga2: <unknown function> (+0x7dbae5) [0x561d3dffeae5]
        (11) icinga2: boost_asio_detail_posix_thread_function (+0xf) [0x561d3da93aff]
        (12) libpthread.so.0: <unknown function> (+0x74a4) [0x7f0598a8d4a4]
        (13) libc.so.6: clone (+0x3f) [0x7f05973cad0f]

After debugging a little, we figured that it is related to notifications. We disabled all notifications and icinga2 was able to start again without crashing. Then we were able to reactivate all notifications, as it seems that there was a buggy notification queued that caused the crash. Our theory is, that disabling all notifications cleared the buggy icinga2 state.

To Reproduce

We have no idea what caused this bug, except the fact that it seems to be related to notifications.

Expected behavior

No crashes.

Your Environment

Include as many relevant details about the environment you experienced the problem in

Application version: r2.12.0-1

System information:
Platform: Debian GNU/Linux
Platform version: 9 (stretch)
Kernel: Linux
Kernel version: 4.9.0-13-amd64
Architecture: x86_64

Disabled features: compatlog debuglog elasticsearch gelf graphite icingadb livestatus opentsdb perfdata statusdata
Enabled features: api checker command ido-mysql influxdb mainlog notification syslog

Icinga Web 2: 2.8.2

Config validation:

[2020-08-21 08:18:46 +0200] information/cli: Icinga application loader (version: r2.12.0-1)
[2020-08-21 08:18:46 +0200] information/cli: Loading configuration file(s).
[2020-08-21 08:18:46 +0200] information/ConfigItem: Committing config item(s).
[2020-08-21 08:18:46 +0200] information/ApiListener: My API identity: monty1.uni-paderborn.de
[2020-08-21 08:18:53 +0200] warning/ApplyRule: Apply rule 'notify_host_imt_srv_mail' (in /etc/icinga2/zones.d/master/notifications.conf: 369:1-369:53) for type 'Notification' does not match anywhere!
[2020-08-21 08:18:53 +0200] warning/ApplyRule: Apply rule 'notify_service_imt_srv_mail' (in /etc/icinga2/zones.d/master/notifications.conf: 378:1-378:59) for type 'Notification' does not match anywhere!
[2020-08-21 08:18:53 +0200] warning/ApplyRule: Apply rule 'mysql_backup_check_exists' (in /etc/icinga2/zones.d/master/services/mysql_cluster.conf: 19:1-19:41) for type 'Service' does not match anywhere!
[2020-08-21 08:18:53 +0200] warning/ApplyRule: Apply rule 'check_http_content' (in /etc/icinga2/zones.d/master/services/solr.conf: 1:0-1:33) for type 'Service' does not match anywhere!
[2020-08-21 08:18:53 +0200] warning/ApplyRule: Apply rule 'check_ipmi' (in /etc/icinga2/zones.d/master/services/sun.conf: 1:0-1:25) for type 'Service' does not match anywhere!
[2020-08-21 08:18:53 +0200] warning/ApplyRule: Apply rule 'check_windows-failover_sp' (in /etc/icinga2/zones.d/master/services/windows.conf: 13:1-13:41) for type 'Service' does not match anywhere!
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 SyslogLogger.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 904 Hosts.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 455 Downtimes.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 7 NotificationCommands.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 253 Comments.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 23810 Notifications.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 140 HostGroups.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 EventCommand.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 6774 Dependencies.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 463 Zones.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 463 Endpoints.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 498 ApiUsers.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 176 CheckCommands.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 7 TimePeriods.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 10 UserGroups.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 31 Users.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 10515 Services.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 42 ServiceGroups.
[2020-08-21 08:18:53 +0200] information/ConfigItem: Instantiated 181 ScheduledDowntimes.
[2020-08-21 08:18:53 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2020-08-21 08:18:53 +0200] information/cli: Finished validating the configuration file(s).

Additional context

We are running a single master node without HA.

@Al2Klimov
Copy link
Member

Hello @xschlef and thank you for reporting!

Please enable core dumps and ask for an upload link once the bug occurs again.

Best,
AK

@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Aug 25, 2020
@xschlef
Copy link
Author

xschlef commented Sep 4, 2020

It started happening again, but icinga did not produce any coredumps.

Can I privately send you a stack trace of the crashes? Maybe that helps.

@Al2Klimov
Copy link
Member

@xschlef
Copy link
Author

xschlef commented Oct 6, 2020

Do you have any update on this? The issue seems to be recurring, at least once in two weeks.

@Al2Klimov
Copy link
Member

There aren't any core dumps and the stack traces look like this:

    (4) icinga2: icinga::NotificationComponent::NotificationTimerHandler() (+0x159b) [0x561d3dec965b]

Cryptic memory addresses, not even line numbers. Unfortunately I can't do much.

@xschlef
Copy link
Author

xschlef commented Oct 6, 2020

I uploaded at least 3 crashdumps to the nextcloud link you provided. They should contain at least some info.

... #9 icinga::NotificationComponent::NotificationTimerHandler() (this=0x7f1d98000c80) at ./lib/notification/notificationcomponent.cpp:194 unstashedNotification = {px = 0x0} __for_range = @0x7f1da64222f8: {px = 0x7f1d6005c370} __for_begin = <optimized out> unstashedNotifications = {px = 0x7f1d6005c370} olock = {m_Object = 0x7f1d6005c370, m_Locked = true} notificationName = {m_Data = <incomplete type>} checkable = {px = 0x7f1d9474f0c0} reachable = true __for_range = <unknown type in /usr/lib/debug/.build-id/bf/c4c3ceb9d9b20ab33231fdac089a7d850d34b8.debug, CU 0x16c4f87, DIE 0x1749d24> now = 1601188626.9996431 myEndpoint = {px = 0x7f1d9c071ca0} ...

@Al2Klimov
Copy link
Member

No, you've uploaded crash reports / stack traces, but no coredumps.

@xschlef
Copy link
Author

xschlef commented Oct 15, 2020

We've revisited all settings and we are still not getting any core dumps. The only thing related to our latest crash loop was a core dump of one of our checks.

@xschlef
Copy link
Author

xschlef commented Dec 17, 2020

We had a few more crashes, but still no coredumps.
We maybe found a source, which may lie in our configmanagement:
We deploy every hour through cron and remove the folder /etc/icinga2/conf.d, as we don't want the configs there.
A recent icinga2 package update then failed, because the debian package relies on the folder being present.
Our config management tried to fix this every hour and failed.
So maybe this is a source for the crash, as it always happened after a config reload.

We fixed the source of the config reloads. So I will report back in 1-2 months if the crashes disappear. Of course this is no real fix, but could explain why only we experienced this problem.

@xschlef
Copy link
Author

xschlef commented Feb 19, 2021

The crash keeps happening, but only about once a month. We are still not getting any coredumps as the process seems to catch the error (nothing in kern.log / dmesg) and then terminates itself with status 134.

My gdb skills are not the best:

gdb -p $(pidof icinga2 | cut -d ' ' -f3)
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 25387
Reading symbols from /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2...Reading symbols from /usr/lib/debug/.build-id/97/b6eedd15b29e8ebdd0cd2e92db56e3aea70ad7.debug...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_context.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_coroutine.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_date_time.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_filesystem.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_system.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_program_options.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_regex.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_chrono.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libboost_atomic.so.1.67.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libssl.so.1.1...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libsystemd.so.0...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libedit.so.2...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libtinfo.so.6...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libstdc++.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libicudata.so.63...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libicui18n.so.63...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libicuuc.so.63...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/liblzma.so.5...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/liblz4.so.1...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libgcrypt.so.20...(no debugging symbols found)...done.
Reading symbols from /usr/lib/x86_64-linux-gnu/libbsd.so.0...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libgpg-error.so.0...(no debugging symbols found)...done.
0x00007f5f0611d720 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) c
Continuing.

Program received signal SIGUSR2, User defined signal 2.
0x00007f5f0611d720 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) c
Continuing.
[New Thread 0x7f5f03ec5700 (LWP 25541)]
[New Thread 0x7f5f03f06700 (LWP 25542)]
[New Thread 0x7f5f03f47700 (LWP 25543)]
[New Thread 0x7f5f03f88700 (LWP 25544)]
[New Thread 0x7f5f03fc9700 (LWP 25545)]
[New Thread 0x7f5f03e84700 (LWP 25546)]
[New Thread 0x7f5f03e43700 (LWP 25547)]
[New Thread 0x7f5f03e02700 (LWP 25548)]
[Thread 0x7f5f03e84700 (LWP 25546) exited]
[Thread 0x7f5f03e02700 (LWP 25548) exited]
[Thread 0x7f5f03fc9700 (LWP 25545) exited]
[Thread 0x7f5f03f88700 (LWP 25544) exited]
[Thread 0x7f5f03f47700 (LWP 25543) exited]
[Thread 0x7f5f03ec5700 (LWP 25541) exited]
[Thread 0x7f5f03e43700 (LWP 25547) exited]
[Thread 0x7f5f03f06700 (LWP 25542) exited]
[Inferior 1 (process 25387) exited with code 0206]
(gdb) bt
No stack.
(gdb) thread apply all bt full
(gdb) bt
No stack.

Do you have any hints for me so I can generate a useful backtrace?

@julianbrost
Copy link
Contributor

julianbrost commented Feb 19, 2021

at ./lib/notification/notificationcomponent.cpp:194 unstashedNotification = {px = 0x0}

Looks like enough of a hint to me, if there's a 0 value in unstashedNotifications, it'll crash:

for (Dictionary::Ptr unstashedNotification : unstashedNotifications) {
try {
Log(LogNotice, "NotificationComponent")
<< "Attempting to send stashed notification '" << notificationName << "'.";
notification->BeginExecuteNotification(
(NotificationType)(int)unstashedNotification->Get("type"),
(CheckResult::Ptr)unstashedNotification->Get("cr"),
(bool)unstashedNotification->Get("force"),
(bool)unstashedNotification->Get("reminder"),
(String)unstashedNotification->Get("author"),
(String)unstashedNotification->Get("text")
);
} catch (const std::exception& ex) {
Log(LogWarning, "NotificationComponent")
<< "Exception occurred during notification for object '"
<< notificationName << "': " << DiagnosticInformation(ex, false);
}
}
}

This also gets written to the state file and restored on startup, so this would also explain the behavior you described in the initial message that it disappeared when removing the notification.

[state, no_user_view, no_user_modify] Array::Ptr stashed_notifications {
default {{{ return new Array(); }}}
};

@xschlef Can you currently reproduce the crash on startup? If so, can you upload the file /var/lib/icinga2/icinga2.state or look at the stashed_notifications values in there?

@xschlef
Copy link
Author

xschlef commented Mar 8, 2021

It crashed again. Here is the output of the statefile for every object with stashed_notifications set:

1042:{"name":"tula.uni-paderborn.de!btrfs quota vimp!notify_service_vimp_mail","type":"Notification","update":{"last_notification":1614936634.468515,"last_problem_notification":1614936634.468515,"next_notification":1615201715.136157,"no_more_notifications":true,"notification_number":2.0,"notified_problem_users":["pinger","sybille"],"stashed_notifications":[{"author":"","cr":{"active":true,"check_source":"tula.uni-paderborn.de","command":null,"execution_end":1615201655.938914,"execution_start":1615201655.93878,"exit_status":0.0,"output":"Remote Icinga instance 'tula.uni-paderborn.de' is not connected to 'monty1.uni-paderborn.de'","performance_data":null,"schedule_end":1615201655.938914,"schedule_start":1615201655.435223,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":3.0,"state_type":1.0}
1046:{"name":"tula.uni-paderborn.de!btrfs quota vimp!notify_service_imt_srv_telegram","type":"Notification","update":{"last_notification":1614936634.466702,"last_problem_notification":1614936634.466702,"next_notification":1615201715.150247,"no_more_notifications":true,"notification_number":2.0,"notified_problem_users":["server-intern"],"stashed_notifications":[{"author":"","cr":{"active":true,"check_source":"tula.uni-paderborn.de","command":null,"execution_end":1615201655.938914,"execution_start":1615201655.93878,"exit_status":0.0,"output":"Remote Icinga instance 'tula.uni-paderborn.de' is not connected to 'monty1.uni-paderborn.de'","performance_data":null,"schedule_end":1615201655.938914,"schedule_start":1615201655.435223,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":3.0,"state_type":1.0}
1042:{"name":"tula.uni-paderborn.de!btrfs quota vimp!notify_service_imt_srv_mail","type":"Notification","update":{"last_notification":1614936634.466776,"last_problem_notification":1614936634.466776,"next_notification":1615201715.041421,"no_more_notifications":true,"notification_number":2.0,"notified_problem_users":["server-intern"],"stashed_notifications":[{"author":"","cr":{"active":true,"check_source":"tula.uni-paderborn.de","command":null,"execution_end":1615201655.938914,"execution_start":1615201655.93878,"exit_status":0.0,"output":"Remote Icinga instance 'tula.uni-paderborn.de' is not connected to 'monty1.uni-paderborn.de'","performance_data":null,"schedule_end":1615201655.938914,"schedule_start":1615201655.435223,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":3.0,"state_type":1.0}

@julianbrost
Copy link
Contributor

Can you double-check that you did not accidentally truncate these lines? They end in the middle of a JSON object and the number at the beginning should the length of the object and these lines are shorter than that.

@xschlef
Copy link
Author

xschlef commented Mar 8, 2021

You are right, my format and search script used a wrong replacement technique. Here is the correct output:

{'name': 'tula.uni-paderborn.de!btrfs quota vimp!notify_service_vimp_mail', 'type': 'Notification', 'update': {'last_notification': 1614936634.468515, 'last_problem_notification': 1614936634.468515, 'next_notification': 1615201715.136157, 'no_more_notifications': True, 'notification_number': 2.0, 'notified_problem_users': ['pinger', 'sybille'], 'stashed_notifications': [{'author': '', 'cr': {'active': True, 'check_source': 'tula.uni-paderborn.de', 'command': None, 'execution_end': 1615201655.938914, 'execution_start': 1615201655.93878, 'exit_status': 0.0, 'output': "Remote Icinga instance 'tula.uni-paderborn.de' is not connected to 'monty1.uni-paderborn.de'", 'performance_data': None, 'schedule_end': 1615201655.938914, 'schedule_start': 1615201655.435223, 'state': 3.0, 'ttl': 0.0, 'type': 'CheckResult', 'vars_after': {'attempt': 1.0, 'reachable': True, 'state': 3.0, 'state_type': 1.0}, 'vars_before': {'attempt': 1.0, 'reachable': True, 'state': 1.0, 'state_type': 1.0}}, 'force': False, 'reminder': False, 'text': '', 'type': 32.0}], 'suppressed_notifications': 0.0, 'type': 'Notification', 'version': 0.0}}
{'name': 'tula.uni-paderborn.de!btrfs quota vimp!notify_service_imt_srv_telegram', 'type': 'Notification', 'update': {'last_notification': 1614936634.466702, 'last_problem_notification': 1614936634.466702, 'next_notification': 1615201715.150247, 'no_more_notifications': True, 'notification_number': 2.0, 'notified_problem_users': ['server-intern'], 'stashed_notifications': [{'author': '', 'cr': {'active': True, 'check_source': 'tula.uni-paderborn.de', 'command': None, 'execution_end': 1615201655.938914, 'execution_start': 1615201655.93878, 'exit_status': 0.0, 'output': "Remote Icinga instance 'tula.uni-paderborn.de' is not connected to 'monty1.uni-paderborn.de'", 'performance_data': None, 'schedule_end': 1615201655.938914, 'schedule_start': 1615201655.435223, 'state': 3.0, 'ttl': 0.0, 'type': 'CheckResult', 'vars_after': {'attempt': 1.0, 'reachable': True, 'state': 3.0, 'state_type': 1.0}, 'vars_before': {'attempt': 1.0, 'reachable': True, 'state': 1.0, 'state_type': 1.0}}, 'force': False, 'reminder': False, 'text': '', 'type': 32.0}], 'suppressed_notifications': 0.0, 'type': 'Notification', 'version': 0.0}}
{'name': 'tula.uni-paderborn.de!btrfs quota vimp!notify_service_imt_srv_mail', 'type': 'Notification', 'update': {'last_notification': 1614936634.466776, 'last_problem_notification': 1614936634.466776, 'next_notification': 1615201715.041421, 'no_more_notifications': True, 'notification_number': 2.0, 'notified_problem_users': ['server-intern'], 'stashed_notifications': [{'author': '', 'cr': {'active': True, 'check_source': 'tula.uni-paderborn.de', 'command': None, 'execution_end': 1615201655.938914, 'execution_start': 1615201655.93878, 'exit_status': 0.0, 'output': "Remote Icinga instance 'tula.uni-paderborn.de' is not connected to 'monty1.uni-paderborn.de'", 'performance_data': None, 'schedule_end': 1615201655.938914, 'schedule_start': 1615201655.435223, 'state': 3.0, 'ttl': 0.0, 'type': 'CheckResult', 'vars_after': {'attempt': 1.0, 'reachable': True, 'state': 3.0, 'state_type': 1.0}, 'vars_before': {'attempt': 1.0, 'reachable': True, 'state': 1.0, 'state_type': 1.0}}, 'force': False, 'reminder': False, 'text': '', 'type': 32.0}], 'suppressed_notifications': 0.0, 'type': 'Notification', 'version': 0.0}}

@julianbrost
Copy link
Contributor

Looks like you parsed the JSON in Python and then used repr() to print it. As this could in theory depend on subtleties of parsing, can you please run the following on the state file and share the result:

sed 's/,\([0-9]*\):/,\n\1:/g' icinga2.state | grep stashed_notifications

@xschlef
Copy link
Author

xschlef commented Mar 8, 2021

Can you supply me with a new nextcloud upload link? This dump contains a large number of hosts in our infra.

@julianbrost
Copy link
Contributor

I'm a bit surprised that this matches more than these three objects. Can you verify that my sed command actually worked? It should put each object on its own line.

Anyways, you can upload it here: https://nextcloud.icinga.com/index.php/s/XX7883yQsPEWynJ

@julianbrost
Copy link
Contributor

Thanks for uploading, I just took a look at them, most are empty lists, this should be fine, and the remaining ones all look fine with one intact stashed notification. So doesn't look like it comes from the state file. So probably back to the drawing board (or looking at the code again in this case).

By the way, have you upgraded Icinga 2 since the original report? I don't think this was fixed, but I prefer looking at the most recent version if I know the problem still exists there.

@xschlef
Copy link
Author

xschlef commented Mar 30, 2021

we are running
icinga2 - The Icinga 2 network monitoring daemon (version: r2.12.3-1)

since 2021-02-19

@jgrammen-agilitypr
Copy link

I appear to have the same problem. But cannot figure out how to get around it.
I enabled core dumps following the instructions : https://icinga.com/docs/icinga-2/latest/doc/21-development/#core-dump
but I am not getting any dump in /var/lib/cores/

Could someone at least tell me how to clear the notifications so that I can get my environment working again?

Here is the crash report.

  Application version: r2.12.4-1

System information:
  Platform: Ubuntu
  Platform version: 20.04.2 LTS (Focal Fossa)
  Kernel: Linux
  Kernel version: 5.8.0-1038-aws
  Architecture: x86_64

Build information:
  Compiler: GNU 9.3.0
  Build host: runner-hh8q3bz2-project-298-concurrent-0
  OpenSSL version: OpenSSL 1.1.1f  31 Mar 2020

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
Stacktrace:

        (0) libc.so.6: gsignal (+0xcb) [0x7fbd3e30b18b]
        (1) libc.so.6: abort (+0x12b) [0x7fbd3e2ea859]
        (2) libc.so.6: <unknown function> (+0x25729) [0x7fbd3e2ea729]
        (3) libc.so.6: <unknown function> (+0x36f36) [0x7fbd3e2fbf36]
        (4) icinga2: <unknown function> (+0x38c839) [0x55ab975df839]
        (5) icinga2: icinga::NotificationComponent::NotificationTimerHandler() (+0x16ac) [0x55ab9766f47c]
        (6) icinga2: <unknown function> (+0x800fc9) [0x55ab97a53fc9]
        (7) icinga2: icinga::Timer::Call() (+0x31) [0x55ab97a7b6d1]
        (8) icinga2: <unknown function> (+0x813361) [0x55ab97a66361]
        (9) icinga2: boost::asio::detail::executor_op<boost::asio::detail::work_dispatcher<icinga::ThreadPool::Post<std::function<void ()> >(std::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, std::allocator<void>*, boost::system::error_code const&, unsigned long) (+0xc6) [0x55ab97a27046]
        (10) icinga2: <unknown function> (+0x8a7b15) [0x55ab97afab15]
        (11) icinga2: <unknown function> (+0x7c5432) [0x55ab97a18432]
        (12) icinga2: boost_asio_detail_posix_thread_function (+0x14) [0x55ab97aebb34]
        (13) libpthread.so.0: <unknown function> (+0x9609) [0x7fbd3dc69609]
        (14) libc.so.6: clone (+0x43) [0x7fbd3e3e7293]

***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***

Failed to launch GDB: No such file or directory

@jgrammen-agilitypr
Copy link

more debugging, no core dumps
checked the state file with the grep provide above
sed 's/,\([0-9]*\):/,\n\1:/g' icinga2.state | grep stashed_notifications
it matches against a large number of hosts,

342:{"name":"hostnameredacted!sms-host-escalation","type":"Notification","update":{"last_notification":0.0,"last_problem_notification":0.0,"next_notification":0.0,"no_more_notifications":false,"notification_number":0.0,"notified_problem_users":[],"stashed_notifications":[],"suppressed_notifications":0.0,"type":"Notification","version":0.0}},

but none have anything in stashed_notifications that i can see.

sed 's/,\([0-9]*\):/,\n\1:/g' icinga2.state | grep 'stashed_notifications\"\:\[.+\]'

matched nothing.
State file is 12MB, although not exactly the same file size on both master hosts.

Can I delete the state file? is there any way to clear the notifications so that the issue can be bypassed?

@julianbrost
Copy link
Contributor

If your system uses systemd, you can also have a look at coredumpctl, maybe that collected some core dumps.

I'm not sure if deleting the state file would fix something. I suspected it as some null object could possibly have been restored from there, but all the snippets I've seen from your state files look sane. I mean you could delete it but as the name suggests, you will lose some state (notably downtime, notification, acknowledgement state, possibly more).

@jgrammen-agilitypr
Copy link

After checking via dpkg, and apt-cache policy and coredumpctl [via systemd-coredump] is not installed, as ubuntu 20.04 defaults to using apport

systemd-coredump:
  Installed: (none)
  Candidate: 245.4-4ubuntu3.7

It also appears that the application does not exit with SIGSEGV and thus does not trigger core dumps

534 Jul 14 15:49:10 host2 systemd[1]: Started Icinga host/service/network monitoring system.
535 Jul 14 15:51:06 host2 systemd[1]: icinga2.service: Main process exited, code=exited, status=134/n/a
536 Jul 14 15:51:06 host2 systemd[1]: icinga2.service: Failed with result 'exit-code'.
537 Jul 14 15:51:39 host2 systemd[1]: Starting Icinga host/service/network monitoring system...

@julianbrost
Copy link
Contributor

134 means exited due to SIGABRT, which - by default - also dumps core. However, I'm not sure if and how our crash handler might interfere with this.

@jgrammen-agilitypr
Copy link

I tested the best I could and apport will generate .crash / coredumps (in /var/crash)

root@ca04vlnms01:/var/crash# sleep 10 &
[1] 10653
root@ca04vlnms01:/var/crash# killall -SIGSEGV sleep
root@ca04vlnms01:/var/crash#
[1]+  Segmentation fault      (core dumped) sleep 10
root@ca04vlnms01:/var/crash# sleep 30 &
[1] 10657
root@ca04vlnms01:/var/crash# killall -SIGABRT sleep
root@ca04vlnms01:/var/crash#
[1]+  Aborted                 (core dumped) sleep 30

But it seems to refused to do it for icinga2.
the following is set

sysctl -w fs.suid_dumpable=2

max core is set

Max core file size        unlimited            unlimited            bytes

and I could not set the dump format

sysctl -w kernel.core_pattern=/var/lib/cores/core.%e.%p
journalctl -xe
Jul 15 07:48:09 ca04vlnms01.agilitypr.internal kernel: Unsafe core_pattern used with fs.suid_dumpable=2.
                                                       Pipe handler or fully qualified core dump path required.
                                                       Set kernel.core_pattern before fs.suid_dumpable

instead I left it at the default

kernel.core_pattern = |/usr/share/apport/apport %p %s %c %d %P %E

@jgrammen-agilitypr
Copy link

deleting the icinag2.state file on the secondary node (the one that kept crashing) has resolved the problem. both services instances are now running and not crashing. Not closing issue, since the core problem has not yet been identified and resolved

@lippserd
Copy link
Member

Do you still have that state file so we can have a look? If yes, I’ll provide an upload link.

@jgrammen-agilitypr
Copy link

jgrammen-agilitypr commented Jul 15, 2021

sadly I deleted it. I didn't even think to try and preserve it. And yes I understand that debugging the issue without it, is impossible

@arjanoosting
Copy link

i tried to enable the core dumps following the instruction on https://icinga.com/docs/icinga-2/latest/doc/21-development/#core-dump but nothing is written in /var/lib/cores

@julianbrost
Copy link
Contributor

I don't think there is need to further debug this as I know the cause of the crash. There just was no time to fix it so far unfortunately.

@arjanoosting
Copy link

ok, and how do I fix the crashing for now?

@xschlef
Copy link
Author

xschlef commented Sep 14, 2021

Sorry for the late reply, but a quick fix without deleting the state file is removing all configuration related to notifications and then restarting icinga2. after a successful restart you can restore the notification configuration and everything works fine again.

all notifications that were not sent will be deleted due to this

@julianbrost
Copy link
Contributor

That's one option as this clears all the notification stuff from the state file. Another one would be to manually patch the state file and remove just the stashed notifications, but you'd have to be careful when doing this as the JSON objects in the state file are prefixed with their length.

@arjanoosting
Copy link

the stashed notifications seem to be empy:

root@icm01:/var/lib/icinga2# sed 's/,\([0-9]*\):/,\n\1:/g' icinga2.state | grep stashed_notifications | wc -l
14768
root@icm01:/var/lib/icinga2# sed 's/,\([0-9]*\):/,\n\1:/g' icinga2.state | grep 'stashed_notifications"\:\[.+\]' | wc -l
0

most configuration is generated via director in our setup, do I need to remove the Notification and apply rules for Notifications? Or do i need to remove more?

@julianbrost
Copy link
Contributor

grep 'stashed_notifications"\:\[.+\]'

That probably doesn't search for what you think it does. You either need \+ instead of just + or use grep -E.

@arjanoosting
Copy link

Ah found them:
"stashed_notifications":[null]

@arjanoosting
Copy link

Thanks for the support. The issue is fixed on our site.
I found the relevant Notifications with '[null]' in the stashed_notification list and disabled the corresponding Notification profiles in Director. After deploying I could enable the feature notification again on the node.
Then iIenabled the notification profiles in Director and deployed them again.

@Al2Klimov Al2Klimov added core/crash Shouldn't happen, requires attention area/notifications Notification events labels Oct 19, 2021
@107142
Copy link

107142 commented Nov 18, 2021

So, this has with great success brought down my master server after upgrade to v2.13. Worst part was that during staging (and testing) the error did not manifest but when I deployed it in the production everything went to hell (1500 host, 8000 services).

I realize resources are scarce and time limited (and I very much appreciate that Icinga is an OSS), but this could seriously ruin somebody's day. Please consider upping the priority on this one.

Al2Klimov added a commit that referenced this issue Dec 13, 2021
not to confuse the state file deserializator with e.g. `"type":32` on startup.
That would unexpectedly restore null (not `{"type":32}`) as there's no type "32".

refs #8186
@Al2Klimov Al2Klimov self-assigned this Dec 13, 2021
Al2Klimov added a commit that referenced this issue Dec 13, 2021
not to confuse the state file deserializator with e.g. `"type":32` on startup.
That would unexpectedly restore null (not `{"type":32}`) as there's no type "32".

refs #8186
Al2Klimov added a commit that referenced this issue Jan 4, 2022
i.e. the confusion of the state file deserializator with e.g. `"type":32` on startup.
That would unexpectedly restore (the now ignored) null (not `{"type":32}`) as there's no type "32".

refs #8186
Al2Klimov added a commit that referenced this issue Jan 4, 2022
i.e. the confusion of the state file deserializator with e.g. `"type":32` on startup.
That would unexpectedly restore (the now ignored) null (not `{"type":32}`) as there's no type "32".

refs #8186
@icinga-probot
Copy link
Contributor

icinga-probot bot commented Jan 19, 2022

Someone closed this issue just now, but due to a missing feature in the GitHub API and the high amount of comments here I can't figure out whether this issue was closed due to a PR merge. Please check by yourself whether this issue is on the correct milestone.

@icinga-probot icinga-probot bot added this to the 2.14.0 milestone Jan 19, 2022
@icinga-probot
Copy link
Contributor

icinga-probot bot commented Jan 19, 2022

Closed this issue because #9123 has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events bug Something isn't working core/crash Shouldn't happen, requires attention
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants