Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga 2.11.x crashing every 30 minutes in JsonRpcConnection::HandleAndWriteHeartbeats() #7569

Closed
mattrose opened this issue Oct 9, 2019 · 25 comments
Assignees
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention bug Something isn't working duplicate This issue or pull request already exists

Comments

@mattrose
Copy link

mattrose commented Oct 9, 2019

Describe the bug

icinga 2.11.0-1 seems to be crashing every 30 minutes. Exiting with return code 134.

For example:

root@icinga-master.servers.fr:/var/log/icinga2/crash#
[1]+ Exit 134 nohup icinga2 daemon > /var/log/icinga-matt.log (wd: ~)
(wd now: /var/log/icinga2/crash)

To Reproduce

Unsure of reproduction steps. I upgraded from 2.10 to 2.11 on ubuntu, using the icinga2 package repository, and it started crashing every 30 minutes or so.

Expected behavior

Not to crash

Your Environment

Include as many relevant details about the environment you experienced the problem in

root@icinga-master.servers.fr:/var/log/icinga2/crash# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.11.0-1)

Copyright (c) 2012-2019 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Ubuntu
  Platform version: 18.04.1 LTS (Bionic Beaver)
  Kernel: Linux
  Kernel version: 4.15.0-43-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 8.3.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
  • Operating System and version:
root@icinga-master.servers.fr:/var/log/icinga2/crash# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
  • Enabled features (icinga2 feature list):
root@icinga-master.servers.fr:/var/log/icinga2/crash# icinga2 feature list
Disabled features: command compatlog elasticsearch gelf graphite influxdb livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker debuglog ido-pgsql mainlog notification
  • Config validation (icinga2 daemon -C):

With some private information replaced with 'xxx'

[2019-10-09 13:59:57 +0000] information/cli: Icinga application loader (version: r2.11.0-1)
[2019-10-09 13:59:57 +0000] information/cli: Loading configuration file(s).
[2019-10-09 13:59:57 +0000] warning/config: Ignoring directory '/var/lib/icinga2/api/zones/xxx' for unknown zone 'xxx'.
[2019-10-09 13:59:57 +0000] warning/config: Ignoring directory '/var/lib/icinga2/api/zones/xxx' for unknown zone 'xxx'.
[2019-10-09 13:59:57 +0000] warning/config: Ignoring directory '/var/lib/icinga2/api/zones/xxx' for unknown zone 'xxx'.
[2019-10-09 13:59:57 +0000] warning/config: Ignoring directory '/var/lib/icinga2/api/zones/jenkins.servers.fr.internal.xxx' for unknown zone 'jenkins.servers.fr.internal.xxx'.
[2019-10-09 13:59:57 +0000] warning/config: Ignoring directory '/var/lib/icinga2/api/zones/mail.xxx' for unknown zone 'mail.xxx'.
[2019-10-09 13:59:57 +0000] warning/config: Ignoring directory '/var/lib/icinga2/api/zones/xxx'.
[2019-10-09 13:59:57 +0000] warning/config: Ignoring directory '/var/lib/icinga2/api/zones/xxx'.
[2019-10-09 13:59:57 +0000] information/ConfigItem: Committing config item(s).
[2019-10-09 13:59:57 +0000] information/ApiListener: My API identity: icinga-master.servers.fr.internal.xxx
[2019-10-09 13:59:58 +0000] warning/ApplyRule: Apply rule 'health-check-parent-' (in /etc/icinga2/zones.d/global-templates/dependencies.conf: 18:1-18:81) for type 'Dependency' does not match anywhere!
[2019-10-09 13:59:58 +0000] warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /etc/icinga2/conf.d/notifications.conf: 11:1-11:45) for type 'Notification' does not match anywhere!
[2019-10-09 13:59:58 +0000] warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /etc/icinga2/conf.d/notifications.conf: 23:1-23:48) for type 'Notification' does not match anywhere!
[2019-10-09 13:59:58 +0000] warning/ApplyRule: Apply rule 'backup-downtime' (in /etc/icinga2/conf.d/downtimes.conf: 5:1-5:52) for type 'ScheduledDowntime' does not match anywhere!
[2019-10-09 13:59:58 +0000] warning/ApplyRule: Apply rule 'ping6' (in /etc/icinga2/zones.d/global-templates/services.conf: 27:1-27:29) for type 'Service' does not match anywhere!
[2019-10-09 13:59:58 +0000] warning/ApplyRule: Apply rule 'freeradius-' (in /etc/icinga2/zones.d/global-templates/services.conf: 296:1-296:80) for type 'Service' does not match anywhere!
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 2 FileLoggers.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 277 Dependencies.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 3 NotificationCommands.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 297 Notifications.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 9 HostGroups.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 43 Hosts.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 26 Zones.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 25 Endpoints.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 2 ApiUsers.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 1 UserGroup.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 241 CheckCommands.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 3 TimePeriods.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 3 Users.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 274 Services.
[2019-10-09 13:59:58 +0000] information/ConfigItem: Instantiated 3 ServiceGroups.
[2019-10-09 13:59:58 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2019-10-09 13:59:58 +0000] information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

We run multiple icinga2 instances but much of the information in that output is considered private

Additional context

root@icinga-master.servers.fr:/var/log/icinga2/crash# cat report.1570628747.543626
Application version: r2.11.0-1

System information:
Platform: Ubuntu
Platform version: 18.04.1 LTS (Bionic Beaver)
Kernel: Linux
Kernel version: 4.15.0-43-generic
Architecture: x86_64

Build information:
Compiler: GNU 8.3.0
Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2

Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var

Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
Stacktrace:

(0) libc.so.6: gsignal (+0xc7) [0x7f1c47a35e97]
(1) libc.so.6: abort (+0x141) [0x7f1c47a37801]
(2) libc.so.6: <unknown function> (+0x3039a) [0x7f1c47a2739a]
(3) libc.so.6: <unknown function> (+0x30412) [0x7f1c47a27412]
(4) icinga2: <unknown function> (+0x578603) [0x55aadd980603]
(5) icinga2: icinga::JsonRpcConnection::HandleAndWriteHeartbeats(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0x960) [0x55aadda20270]
(6) icinga2: <unknown function> (+0x5e277d) [0x55aadd9ea77d]
(7) libboost_context.so.1.67.0: make_fcontext (+0x2f) [0x7f1c4509a65f]

  • This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
  • please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
  • information that might be useful in order to reproduce this problem.

Failed to launch GDB: No such file or directory




@mattrose
Copy link
Author

mattrose commented Oct 9, 2019

I'll try to get gdb and the debug symbols on there so that I can give you more information

@mattrose
Copy link
Author

mattrose commented Oct 9, 2019

nm. Debug symbol packages don't seem to be available in the icinga2 packages repo

@mattrose
Copy link
Author

mattrose commented Oct 9, 2019

Hmm. This is curious

This is pretty much the last few lines of the debug log

[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint 'icinga-satellite.servers.ca.internal.networkradius.com' has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity 'icinga-satellite.servers.ca.internal.networkradius.com'
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint xxx.client has requested heartbeat message but hasn't responded in time. Closing connection.
[2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: API client disconnected for identity xxx.client
[2019-10-09 13:45:47 +0000] critical/Application: Icinga 2 has terminated unexpectedly. Additional information can be found in '/var/log/icinga2/crash/report.1570628747.543626'

@mattrose
Copy link
Author

mattrose commented Oct 9, 2019

This seems to be new code that was written for 2.11 Could it be causing the crash?

ad28380

@mattrose
Copy link
Author

mattrose commented Oct 9, 2019

It looks like network problems might be causing it to crash?

Just crashed again, debug log output

[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'
[2019-10-09 14:47:25 +0000] warning/ApiListener: Removing API client for endpoint 'xxx.client'. 0 API clients left.
[2019-10-09 14:47:25 +0000] debug/EndpointDbObject: update is_connected=0 for endpoint 'xxx.client'
[2019-10-09 14:47:25 +0000] information/JsonRpcConnection: No messages for identity 'xxx.client' have been received in the last 60 seconds.
[2019-10-09 14:47:25 +0000] warning/JsonRpcConnection: API client disconnected for identity 'xxx.client'

@dnsmichi
Copy link
Contributor

It looks like network problems might be causing it to crash?

Interesting find. Can you provoke this e.g. with firewall rules?

The debug packages for Ubuntu are located in the icinga2-dbg package, https://icinga.com/docs/icinga2/latest/doc/21-development/#debug-requirements

@dnsmichi dnsmichi added the needs feedback We'll only proceed once we hear from you again label Oct 10, 2019
@mattrose
Copy link
Author

I'll see if I can provoke it using firewall rules. I tried installing the debug packages and I just get this error message

root@icinga-master.servers.fr:/var/log/icinga2/crash# apt-get install icinga2-dbg
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 icinga2-dbg : Depends: icinga2-bin (= 2.8.1-0ubuntu2) but 2.11.0-2.bionic is to be installed
               Depends: libicinga2 (= 2.8.1-0ubuntu2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

@mattrose
Copy link
Author

I tried blocking all access to port 5665 from the master for a good half an hour today and I still couldn't get it to crash. Any ideas on how to install the debug packages?

@dnsmichi
Copy link
Contributor

Hm, the problem with the debug packages sounds a little like apt priorities or pinning. Do you have that in place for the icinga package repository? apt-cache showpkg icinga2-dbg should tell more about it.

@mattrose
Copy link
Author

Figured it out, I have a weird priority setting on that server that I can't track down but I managed to get it installed by specifying the version. I also installed gdb

New crash file is attached, but it doesn't seem to have much more information. Do I need to run icinga2 not from systemd to get the crash reporting to work properly?

As well, I noticed that the version reported in the crash report is 2.11.0-1 and the version reported by dpkg is 2.11.0-2, could this be causing a problem?

report.1570789064.592142.txt

@dnsmichi
Copy link
Contributor

dnsmichi commented Oct 14, 2019

I had to incorporate additional package changes when releasing 2.11 (human error, I forgot them), so all the Debian packages now say -2. Technically the source tarball is the same, the only difference here could be package related things. Still curious whether the crashes continue with 2.11.0-2 or the binary was not updated/restarted in this case, creating old crash reports again.

@mattrose
Copy link
Author

Ah, I thought that was the case. I made sure that the latest version of icinga2 was running. It gets restarted on a pretty regular basis, so it's definitely the latest version

@mattrose
Copy link
Author

Can you remove the needs-feedback tag? Or specify what further feedback is needed?

@jivuvtramvae
Copy link

Hi team,

I've experienced the same issue since dual master setup was upgraded from 2.10 to 2.11.2-1 (almost a week ago). My first instance crashes, and the second one goes down in a few minutes. My colleague said he noticed that the second instance once was running for quite a long time. The day before yesterday both instances were running smoothly for about 6 hours but then everything went worse, and now icinga2 crashes every half an hour.

Can't submit reproducing steps as they're unclear for me.

Attaching some information on my environment.
OS version:

root@icinga2-master00:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Icinga2 version:

root@icinga2-master00:~# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.11.2-1)

Copyright (c) 2012-2019 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Ubuntu
  Platform version: 16.04.6 LTS (Xenial Xerus)
  Kernel: Linux
  Kernel version: 4.4.0-138-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 5.4.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

Icinga2 features:

root@icinga2-master00:~# icinga2 feature list
Disabled features: compatlog debuglog elasticsearch gelf graphite influxdb livestatus notification opentsdb statusdata syslog
Enabled features: api checker command ido-mysql mainlog perfdata

Config validation:

[2019-11-01 13:09:52 +0000] information/cli: Icinga application loader (version: r2.11.2-1)
[2019-11-01 13:09:52 +0000] information/cli: Loading configuration file(s).
[2019-11-01 13:09:52 +0000] information/ConfigItem: Committing config item(s).
[2019-11-01 13:09:52 +0000] information/ApiListener: My API identity: icinga2-master00.srv.scp
[2019-11-01 13:09:56 +0000] warning/ApplyRule: Apply rule 'backup-downtime' (in /etc/icinga2/zones.d/master/downtimes.conf: 5:1-5:52) for type 'ScheduledDowntime' does not match anywhere!
[2019-11-01 13:09:56 +0000] warning/ApplyRule: Apply rule 'check_id' (in /etc/icinga2/zones.d/master/scp-services/system.conf: 126:1-126:24) for type 'Service' does not match anywhere!
<... a lot like that type 'Service' with a lot of sensitive data here...>
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 FileLogger.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 4 NotificationCommands.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 8695 Notifications.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 41 HostGroups.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 593 Hosts.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 EventCommand.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 31 Comments.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 262 Zones.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 262 Endpoints.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 ApiUser.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 UserGroup.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 311 CheckCommands.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 3 TimePeriods.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 1 User.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 6039 Services.
[2019-11-01 13:09:56 +0000] information/ConfigItem: Instantiated 9 ServiceGroups.
[2019-11-01 13:09:56 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2019-11-01 13:09:56 +0000] information/cli: Finished validating the configuration file(s).

Stacktrace of all crashes looks pretty the same:

Stacktrace:

        (0) libc.so.6: gsignal (+0x38) [0x7f8670651428]
        (1) libc.so.6: abort (+0x16a) [0x7f867065302a]
        (2) libc.so.6: <unknown function> (+0x2dbd7) [0x7f8670649bd7]
        (3) libc.so.6: <unknown function> (+0x2dc82) [0x7f8670649c82]
        (4) icinga2: icinga::JsonRpcConnection::HandleAndWriteHeartbeats(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0xf2c) [0x70197c]
        (5) /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2() [0x75c46f]
        (6) libboost_context.so.1.67.0: make_fcontext (+0x2f) [0x7f866e0875cf]

@dnsmichi dnsmichi changed the title Icinga 2.11.0 crashing every 30 minutes or so with error code 134 Icinga 2.11.x crashing every 30 minutes in JsonRpcConnection::HandleAndWriteHeartbeats() Nov 2, 2019
@dnsmichi dnsmichi added area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working blocker Blocks a release or needs immediate attention and removed needs feedback We'll only proceed once we hear from you again labels Nov 2, 2019
@dnsmichi
Copy link
Contributor

dnsmichi commented Nov 2, 2019

Another one for the list @lippserd

@jivuvtramvae
Copy link

Hi team,

Any updates on this? Do you maybe need any details/feedback/etc? We're really looking forward to this fix.

Many thanks.

@Al2Klimov
Copy link
Member

🚀 I did it! 🚀

● icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/lib/systemd/system/icinga2.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/icinga2.service.d
           └─limits.conf
   Active: failed (Result: exit-code) since Fri 2019-11-29 17:53:07 UTC; 2 days ago
  Process: 5542 ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/default/icinga2 (code=exited, status=0/SUCCESS)
  Process: 5552 ExecStart=/usr/sbin/icinga2 daemon --close-stdio -e ${ICINGA2_ERROR_LOG} (code=exited, status=139)
 Main PID: 5552 (code=exited, status=139)

Nov 28 13:50:34 aklimov-636691-master-0 icinga2[5552]: [2019-11-28 13:50:34 +0000] information/ConfigItem: Instantiated 1 ApiUser.
Nov 28 13:50:34 aklimov-636691-master-0 icinga2[5552]: [2019-11-28 13:50:34 +0000] information/ConfigItem: Instantiated 1 ApiListener.
Nov 28 13:50:34 aklimov-636691-master-0 icinga2[5552]: [2019-11-28 13:50:34 +0000] information/ConfigItem: Instantiated 235 CheckCommands.
Nov 28 13:50:34 aklimov-636691-master-0 icinga2[5552]: [2019-11-28 13:50:34 +0000] information/ConfigItem: Instantiated 1 User.
Nov 28 13:50:34 aklimov-636691-master-0 icinga2[5552]: [2019-11-28 13:50:34 +0000] information/ConfigItem: Instantiated 6040 Services.
Nov 28 13:50:34 aklimov-636691-master-0 icinga2[5552]: [2019-11-28 13:50:34 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
Nov 28 13:50:34 aklimov-636691-master-0 icinga2[5552]: [2019-11-28 13:50:34 +0000] information/cli: Closing console log.
Nov 28 13:50:34 aklimov-636691-master-0 systemd[1]: Started Icinga host/service/network monitoring system.
Nov 29 17:53:07 aklimov-636691-master-0 systemd[1]: icinga2.service: Main process exited, code=exited, status=139/n/a
Nov 29 17:53:07 aklimov-636691-master-0 systemd[1]: icinga2.service: Failed with result 'exit-code'.

139 - 128 = 11 i.e. SEGV

@lippserd @dnsmichi

@Al2Klimov
Copy link
Member

@dnsmichi I'd suggest to make a v2.12-rc1 w/o this issue being fixed.

@mattrose
Copy link
Author

mattrose commented Dec 2, 2019

Umm, yes, it's SIGSEGV, but that doesn't mean it's not a bug...

@dnsmichi
Copy link
Contributor

dnsmichi commented Dec 3, 2019

@Al2Klimov Since you've mentioned that notifications are now enabled, a possible indicator would be the new NotificationResult as cluster message merged in 2.11. This involves JSON encode/decode for nested dictionaries. Please look into this direction.

@Al2Klimov
Copy link
Member

For me it's not every 30m, but every day:

Terraform

resource "openstack_compute_instance_v2" "aklimov-636691-master" {
	count = 2
	name = "aklimov-636691-master-${count.index}"
	image_name = "Debian 9"
	flavor_name = "s1.xxlarge"
	network {
		name = "${var.tenant_network}"
		fixed_ip_v4 = "10.77.28.${count.index + 3}"
	}
	security_groups = [ "${openstack_compute_secgroup_v2.iliketrains.name}" ]
	key_pair = "${var.openstack_keypair}"
}

resource "openstack_compute_instance_v2" "aklimov-636691-agent-1" {
	count = 251
	name = "aklimov-636691-agent-${count.index}"
	image_name = "Debian 9"
	flavor_name = "s1.2310-custom-1-0.5-13"
	network {
		name = "${var.tenant_network}"
		fixed_ip_v4 = "10.77.28.${count.index + 5}"
	}
	security_groups = [ "${openstack_compute_secgroup_v2.iliketrains.name}" ]
	key_pair = "${var.openstack_keypair}"
}

resource "openstack_compute_instance_v2" "aklimov-636691-agent-2" {
	count = 49
	name = "aklimov-636691-agent-${count.index + 251}"
	image_name = "Debian 9"
	flavor_name = "s1.2310-custom-1-0.5-13"
	network {
		name = "${var.tenant_network}"
		fixed_ip_v4 = "10.77.29.${count.index}"
	}
	security_groups = [ "${openstack_compute_secgroup_v2.iliketrains.name}" ]
	key_pair = "${var.openstack_keypair}"
}

Ansible

---
- hosts: all
  become: yes
  become_method: sudo
  tasks:
  - apt_repository:
      filename: buster
      repo: |
        deb http://deb.debian.org/debian buster main contrib non-free
        deb-src http://deb.debian.org/debian buster main contrib non-free
        deb http://security.debian.org/ buster/updates main contrib non-free
        deb-src http://security.debian.org/ buster/updates main contrib non-free
        deb http://deb.debian.org/debian buster-updates main contrib non-free
        deb-src http://deb.debian.org/debian buster-updates main contrib non-free
  - apt:
      upgrade: dist
    register: upgrade
  - when: upgrade.changed
    reboot: {}
- hosts: icingas
  become: yes
  become_method: sudo
  tasks:
  - apt_key:
      url: https://packages.icinga.com/icinga.key
  - apt_repository:
      filename: icinga
      repo: >
        deb http://packages.icinga.com/{{ ansible_lsb.id |lower }}
        icinga-{{ ansible_lsb.codename }} main
  - loop:
    - icinga2-bin
    - monitoring-plugins
    apt:
      name: '{{ item }}'
- hosts: aklimov-636691-master-0
  become: yes
  become_method: sudo
  tasks:
  - shell: >
      icinga2 node setup
      --zone master
      --listen 0.0.0.0,5665
      --cn {{ inventory_hostname }}
      --master
      --disable-confd;
      rm -f /var/cache/icinga2/icinga2.vars
    args:
      creates: /var/lib/icinga2/certs/ca.crt
    notify: Restart Icinga 2
  - shell: icinga2 daemon -C
    args:
      creates: /var/cache/icinga2/icinga2.vars
  - with_inventory_hostnames:
    - 'icingas:!{{ inventory_hostname }}'
    shell: >
      icinga2 pki ticket --cn {{ item }}
      >/var/cache/icinga2/{{ item }}.ticket
    args:
      creates: '/var/cache/icinga2/{{ item }}.ticket'
  - with_inventory_hostnames:
    - 'icingas:!{{ inventory_hostname }}'
    fetch:
      dest: .tempfiles
      src: '/var/cache/icinga2/{{ item }}.ticket'
  - name: Fetch Icinga 2 master cert
    fetch:
      dest: .tempfiles
      src: '/var/lib/icinga2/certs/{{ inventory_hostname }}.crt'
  handlers:
  - name: Restart Icinga 2
    service:
      name: icinga2
      state: restarted
- hosts: 'icingas:!aklimov-636691-master-0'
  become: yes
  become_method: sudo
  tasks:
  - copy:
      dest: /var/cache/icinga2/trusted.crt
      owner: nagios
      group: nagios
      mode: '0644'
      src: .tempfiles/aklimov-636691-master-0/var/lib/icinga2/certs/aklimov-636691-master-0.crt
  - copy:
      dest: /var/cache/icinga2/my.ticket
      owner: nagios
      group: nagios
      mode: '0600'
      src: '.tempfiles/aklimov-636691-master-0/var/cache/icinga2/{{ inventory_hostname }}.ticket'
  - shell: >
      icinga2 node setup
      --zone {{ inventory_hostname }}
      --endpoint aklimov-636691-master-0,{{ hostvars['aklimov-636691-master-0'].ansible_all_ipv4_addresses[0] }},5665
      --parent_host {{ hostvars['aklimov-636691-master-0'].ansible_all_ipv4_addresses[0] }},5665
      --parent_zone master
      --listen 0.0.0.0,5665
      --ticket `cat /var/cache/icinga2/my.ticket`
      --trustedcert /var/cache/icinga2/trusted.crt
      --cn {{ inventory_hostname }}
      --accept-config
      --accept-commands
      --disable-confd
    args:
      creates: /var/lib/icinga2/certs/ca.crt
    notify: Restart Icinga 2
  handlers:
  - name: Restart Icinga 2
    service:
      name: icinga2
      state: restarted
- hosts: icingas
  become: yes
  become_method: sudo
  tasks:
  - file:
      path: /etc/icinga2/zones.conf.d
      owner: root
      group: root
      mode: '0755'
      state: directory
  - loop:
    - aklimov-636691-master-0
    - aklimov-636691-master-1
    copy:
      dest: '/etc/icinga2/zones.conf.d/{{ item }}.conf'
      owner: root
      group: root
      mode: '0644'
      content: |
        object Endpoint "{{ item }}" {
          host = "{{ hostvars[item].ansible_all_ipv4_addresses[0] }}"
        }
    notify: Restart Icinga 2
  - copy:
      dest: /etc/icinga2/zones.conf.d/master.conf
      owner: root
      group: root
      mode: '0644'
      content: |
        object Zone "master" {
          endpoints = [ "aklimov-636691-master-0", "aklimov-636691-master-1" ]
        }
    notify: Restart Icinga 2
  - copy:
      dest: /etc/icinga2/zones.conf.d/global.conf
      owner: root
      group: root
      mode: '0644'
      content: |
        object Zone "global" {
          global = true
        }
    notify: Restart Icinga 2
  handlers:
  - name: Restart Icinga 2
    service:
      name: icinga2
      state: restarted
- hosts: 'aklimov-636691-master-0:aklimov-636691-master-1'
  become: yes
  become_method: sudo
  tasks:
  - with_inventory_hostnames: 'icingas:!aklimov-636691-master-0:!aklimov-636691-master-1'
    copy:
      dest: '/etc/icinga2/zones.conf.d/{{ item }}.conf'
      owner: root
      group: root
      mode: '0644'
      content: |
        object Endpoint "{{ item }}" {
          host = "{{ hostvars[item].ansible_all_ipv4_addresses[0] }}"
        }
        object Zone "{{ item }}" {
          parent = "master"
          endpoints = [ "{{ item }}" ]
        }
    notify: Restart Icinga 2
  handlers:
  - name: Restart Icinga 2
    service:
      name: icinga2
      state: restarted
- hosts: 'icingas:!aklimov-636691-master-0:!aklimov-636691-master-1'
  become: yes
  become_method: sudo
  tasks:
  - copy:
      dest: '/etc/icinga2/zones.conf.d/{{ inventory_hostname }}.conf'
      owner: root
      group: root
      mode: '0644'
      content: |
        object Endpoint "{{ inventory_hostname }}" {
        }
        object Zone "{{ inventory_hostname }}" {
          parent = "master"
          endpoints = [ "{{ inventory_hostname }}" ]
        }
    notify: Restart Icinga 2
  handlers:
  - name: Restart Icinga 2
    service:
      name: icinga2
      state: restarted
- hosts: icingas
  become: yes
  become_method: sudo
  tasks:
  - copy:
      dest: /etc/icinga2/zones.conf
      content: 'include "zones.conf.d/*.conf"'
    notify: Restart Icinga 2
  handlers:
  - name: Restart Icinga 2
    service:
      name: icinga2
      state: restarted
- hosts: aklimov-636691-master-0
  become: yes
  become_method: sudo
  tasks:
  - loop:
    - global
    - master
    file:
      path: '/etc/icinga2/zones.d/{{ item }}'
      owner: root
      group: root
      mode: '0755'
      state: directory
  - with_inventory_hostnames: 'icingas:!aklimov-636691-master-0:!aklimov-636691-master-1'
    file:
      path: '/etc/icinga2/zones.d/{{ item }}'
      owner: root
      group: root
      mode: '0755'
      state: directory
  - copy:
      dest: '/etc/icinga2/zones.d/global/global.conf'
      owner: root
      group: root
      mode: '0644'
      content: |
        object User "navalny" {
        }

        object NotificationCommand "stabilnost" {
          command = [ "/bin/true" ]
        }

        for (i in range(200)) {
          apply Service i {
            check_command = "dummy"
            command_endpoint = host.name
            check_interval = 5m
            max_check_attempts = 1

            var that = this
            vars.dummy_state = function() use(that) {
              return if (that.last_check_result && that.last_check_result.state) { 0 } else { 2 }
            }

            assign where true
          }
        }

        apply Notification "stabilnost" to Service {
          command = "stabilnost"
          users = [ "navalny" ]
          assign where true
        }
    notify: Restart Icinga 2
  - loop:
    - aklimov-636691-master-0
    - aklimov-636691-master-1
    copy:
      dest: '/etc/icinga2/zones.d/master/{{ item }}.conf'
      owner: root
      group: root
      mode: '0644'
      content: |
        object Host "{{ item }}" {
          check_command = "passive"
          enable_active_checks = false
        }
    notify: Restart Icinga 2
  - with_inventory_hostnames: 'icingas:!aklimov-636691-master-0:!aklimov-636691-master-1'
    copy:
      dest: '/etc/icinga2/zones.d/{{ item }}/{{ item }}.conf'
      owner: root
      group: root
      mode: '0644'
      content: |
        object Host "{{ item }}" {
          check_command = "passive"
          enable_active_checks = false
        }
    notify: Restart Icinga 2
  handlers:
  - name: Restart Icinga 2
    service:
      name: icinga2
      state: restarted

@mattrose
Copy link
Author

mattrose commented Dec 3, 2019

For me it ended up being kind of random. Sometimes the server would stay up for a few hours at a time, sometimes it would go down in 20 minutes or so.

@dnsmichi
Copy link
Contributor

dnsmichi commented Dec 3, 2019

@mattrose In case you are able to measure that, how many notifications are sent in this period of time? Are there any failing notification scripts involved?

Besides, a different question - do you have many agents which still request a certificate and are steadily reconnecting?

@mattrose
Copy link
Author

mattrose commented Dec 3, 2019

On the first question, I can't answer that question immediately. Right now I have systemd just set to restart the icinga daemon automatically when it exits on SEGV.

On the second, I shouldn't have any agents that don't have a certificate, as the certificate exchanged is done through salt (http://www.saltstack.com)

Is there any way to tell through icinga logs if this is happening?

@lippserd
Copy link
Member

lippserd commented Dec 6, 2019

Hi,

Thanks for your information so far. We do think that this is related to #7532. I'll close this one as duplicate. If anything further comes up, please do not hesitate to participate in #7532.

And @mattrose to answer your question: Yes, the Icinga logs indicate certificate requests and failing connections, e.g. loads of Reconnecting to endpoint ... and Certificate validation failed for endpoint ...

Best,
Eric

@lippserd lippserd closed this as completed Dec 6, 2019
@lippserd lippserd added the duplicate This issue or pull request already exists label Dec 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

6 participants